save pandas df into several different CSV files - python

This is some code that will generate some random time series data. Ultimately I am trying to save each days, data into separate CSV files...
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=1000, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 1000)}, index=rng)
I can print each days data with this:
for idx, days in df.groupby(df.index.date):
print(days)
But how could I encorporate savings individual CSV files into a directory csv each file named with the first time stamp entry of month & day? (Code below does not work)
for idx, days in df.groupby(df.index.date):
for day in days:
df2 = pd.DataFrame(list(day))
month_num = df.index.month[0]
day_num = df.index.day[0]
df2.to_csv('/csv/' + f'{month_num}' + '_' + f'{day_num}' + '.csv')

You could iterate over all available days, filter your dataframe and then save.
# iterate over all available days
for date in set(df.index.date):
# filter your dataframe
filtered_df = df.loc[df.index.date == date].copy()
# save it
filename = date.strftime('%m_%d') # filename represented as 'month_day'
filtered_df.to_csv(f"./csv/{filename}.csv")

Related

Converting unix time into datetime within csv

i have a csv file with many lines and three column. first column is the unix time, second column the price, and third column represents the volume of the symbol that has been traded at that specific price. what i'm doing is, calculating ohlc for different time frames (e.g. 1h, 4h, 12h, 1d) out of tha csv file. that is working very well by first converting the unix time into datetime
code:
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['date'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('date')
df = df['price'].resample('4h').ohlc()
df.to_csv('file_4h_ohlc.csv')
result:
date,open,high,low,close
2017-05-01 20:00:00,0.757881,1.07,0.650011,1.069999
target:
i wanna now converte the datetime (2017-05-01 20:00:00) back to the unix time (1493658000) within the same file by keeping the ohlc values. or if not possible so, to save into a different file.
thanks a lot for support and sorry if such question has been already answered, but i didnt find it
-hotshot
You can create a new date column instead of overwriting the existing one, so you can re-use it as the index.
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['datestamp'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('datestamp')
df = df['price'].resample('4h').ohlc()
# Set the index back to the original (after calculating ohlc)
df = df.set_index('date')
# Optional: Drop the datestamp column
df = df.drop(columns=['datestamp'])
df.to_csv('file_4h_ohlc.csv')
Alternatively, you can convert the existing datetime column to a Unix timestamp like so:
df['date'].apply(lambda x : (x - datetime.datetime(1970, 1, 1)).total_seconds())

In a .CSV file, for each location(NAME) calculate the average snow amount per month and then save the results in two separate .csv files in python?

For each NAME in filteredData.csv, calculate the average snow amount per month. Save the results in two separate .csv files (one for 2016 and the other for 2017) name the files average2016.csv and average2017.csv.
I am using Python 3.8 with pandas. I have tried:
df = pd.read_csv('filteredData.csv')
g = df.groupby([df.DATE.dt.year, df.DATE.dt.month, 'NAME'])['SNOW'].mean().reset_index().sort_values()
df_2016 = df.loc[df.DATE.dt.year == 2016]
df_2016.to_csv('average2016.csv', index=False)
df_2017 = df.loc[df.DATE.dt.year == 2017]
df_2017.to_csv('average2017.csv', index=False)
But all I get is errors from this. I am not sure where to start.
This is a small part of the filteredData.csv
Your date field has the datatype object initially, so you need to convert it before calling date conversion functions. I simplified the groupby to group by month after breaking the data set into two dataframes, one for each year.
import numpy as np
import pandas as pd
df = pd.read_csv('filtered_data.csv')
df['date'] = pd.to_datetime(df['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df16 = df[(df.year == 2016)]
df17 = df[(df.year == 2017)]
df_2016 = df16.groupby(df.month).mean()
df_2017 = df17.groupby(df.month).mean()
df_2016.to_csv('average2016.csv', index=False)
df_2017.to_csv('average2017.csv', index=False)

Sort by date with Excel file and Pandas

I am trying to sort my Excel file by the date column. When the code runs it turns the cells from a text string to a time date and it sorts, but only within the same month. That is, when I have dates from October and September it completes by the month.
I have been all over Google and YouTube.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(sheet1['Call_DateTime'], axis=1, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")
I would like it to sort oldest to newest.
Update code and this works.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(['Call_DateTime'], axis=0, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")

pandas - get a dataframe for every day

I have a DataFrame with dates in the index. I make a Subset of the DataFrame for every Day. Is there any way to write a function or a loop to generate these steps automatically?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
import datetime as dt
#Get the channel feeds from Thinkspeak
response = requests.get("https://api.thingspeak.com/channels/518038/feeds.json?api_key=XXXXXX&results=500")
#Convert Json object to Python object
response_data = response.json()
channel_head = response_data["channel"]
channel_bottom = response_data["feeds"]
#Create DataFrame with Pandas
df = pd.DataFrame(channel_bottom)
#rename Parameters
df = df.rename(columns={"field1":"PM 2.5","field2":"PM 10"})
#Drop all entrys with at least on nan
df = df.dropna(how="any")
#Convert time to datetime object
df["created_at"] = df["created_at"].apply(lambda x:dt.datetime.strptime(x,"%Y-%m-%dT%H:%M:%SZ"))
#Set dates as Index
df = df.set_index(keys="created_at")
#Make a DataFrame for every day
df_2018_12_07 = df.loc['2018-12-07']
df_2018_12_06 = df.loc['2018-12-06']
df_2018_12_05 = df.loc['2018-12-05']
df_2018_12_04 = df.loc['2018-12-04']
df_2018_12_03 = df.loc['2018-12-03']
df_2018_12_02 = df.loc['2018-12-02']
Supposing that you do that on the first day of next week (so, exporting monday to sunday next monday, you can do that as follows:
from datetime import date, timedelta
day = date.today() - timedelta(days=7) # so, if today is monday, we start monday before
df = df.loc[today]
while day < today:
df1 = df.loc[str(day)]
df1.to_csv('mypath'+str(day)+'.csv') #so that export files have different names
day = day+ timedelta(days=1)
you can use:
from datetime import date
today = str(date.today())
df = df.loc[today]
and schedule the script using any scheduler such as crontab.
You can create dictionary of DataFrames - then select by keys for DataFrame:
dfs = dict(tuple(df.groupby(df.index.strftime('%Y-%m-%d'))))
print (dfs['2018-12-07'])

Pivot pandas timeseries by year

Is there a shorter or more elegant way to pivot a timeseries by year in pandas? The code below does what I want but I wonder if there is a better way to accomplish this:
import pandas
import numpy
daterange = pandas.date_range(start='2000-01-01', end='2017-12-31', freq='10T')
# generate a fake timeseries of measured wind speeds from 2000 to 2017 in 10min intervals
wind_speed = pandas.Series(data=numpy.random.rand(daterange.size), index=daterange)
# group by year
wind_speed_groups = wind_speed.groupby(wind_speed.index.year).groups
# assemble data frame with columns of wind speed data for every year
wind_speed_pivot = pandas.DataFrame()
for key, group in wind_speed_groups.items():
series = wind_speed[group]
series.name = key
series.index = series.index - pandas.Timestamp(str(key)+'-01-01')
wind_speed_pivot = wind_speed_pivot.join(series, how='outer')
print(wind_speed_pivot)
I'm not sure if this is the fastest method, as I'm adding two columns to your initial dataframe (it's possible to add just one if you want to overwrite it).
import pandas as pd
import numpy as np
import datetime as dt
daterange = pd.date_range(start='2000-01-01', end='2017-12-31', freq='10T')
# generate a fake timeseries of measured wind speeds from 2000 to 2017 in 10min intervals
wind_speed = pd.Series(data=np.random.rand(daterange.size), index=daterange)
df = wind_speed.to_frame("windspeed")
df["year"] = df.index.year
df["pv_index"] = df.index - df["year"].apply(lambda x: dt.datetime(x,1,1))
wind_speed_pivot = df.pivot_table(index=["pv_index"], columns=["year"], values=["windspeed"])

Categories