I have a problem with automatically importing csv and creating pandas dataframe. The code I've got:
from datetime import time
from datetime import date
from datetime import datetime
import os
import fnmatch
def get_local_file(pdate, hour, path='/apps/dev_data/data/'):
"""Get date+hour processing file from local drive
:param pdate: str Processing date
:param hour: str Processing hour
:param path: str Path to file location
:return: Pandas DF Retrieved DataFrame
"""
sdate = pdate + '-' + str(hour)
for p_file in os.listdir(path):
if fnmatch.fnmatch(p_file, 'RSRAN098_IP_R*'+sdate+'*.csv'):
return path+p_file
def get_files(pdate, path='/apps/dev_data/data/'):
hours = [time(i).strftime('%H') for i in range(24)]
fileList=[]
for hour in hours:
fileList.append(get_local_file(pdate, hour))
return fileList
processing_date = datetime.strptime('20170614', '%Y%m%d').date()
a = get_files(str(processing_date).replace('-', '_'))
print a
frame = pd.DataFrame()
list_ = []
for file_ in a:
df = pd.read_csv(file_,index_col=None, header=0, delimiter=';')
list_.append(df)
frame = pd.concat(list_)
The only problem is that I have a fixed date, I can't find a way to put the current date,
you can get current date with datetime module.
replace this
processing_date = datetime.strptime('20170614', '%Y%m%d').date()
with something like datetime.datetime.now()
but I think maybe I don't your point. because the answer seems too straightford.
Related
This is some code that will generate some random time series data. Ultimately I am trying to save each days, data into separate CSV files...
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=1000, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 1000)}, index=rng)
I can print each days data with this:
for idx, days in df.groupby(df.index.date):
print(days)
But how could I encorporate savings individual CSV files into a directory csv each file named with the first time stamp entry of month & day? (Code below does not work)
for idx, days in df.groupby(df.index.date):
for day in days:
df2 = pd.DataFrame(list(day))
month_num = df.index.month[0]
day_num = df.index.day[0]
df2.to_csv('/csv/' + f'{month_num}' + '_' + f'{day_num}' + '.csv')
You could iterate over all available days, filter your dataframe and then save.
# iterate over all available days
for date in set(df.index.date):
# filter your dataframe
filtered_df = df.loc[df.index.date == date].copy()
# save it
filename = date.strftime('%m_%d') # filename represented as 'month_day'
filtered_df.to_csv(f"./csv/{filename}.csv")
I am trying to sort my Excel file by the date column. When the code runs it turns the cells from a text string to a time date and it sorts, but only within the same month. That is, when I have dates from October and September it completes by the month.
I have been all over Google and YouTube.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(sheet1['Call_DateTime'], axis=1, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")
I would like it to sort oldest to newest.
Update code and this works.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(['Call_DateTime'], axis=0, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")
I need to read one column from the multiple csv file present in folder and then extract minimum and maximum dates from the column.
For e.g. if i have folder path "/usr/abc/xyz/" and multiple csv files are present as below
aaa.csv
bbb.csv
ccc.csv
and the files are containing data
aaa.csv is containing the data
name,address,dates
xxx,11111,20190101
yyy,22222,20190201
zzz,33333,20190101
bbb.csv is containing the data
name,address,dates
fff,11111,20190301
ggg,22222,20190501
hhh,33333,20190601
so I need to extract the minimum and maximum dates from the files and in the above case the date range should be 20190101 to 20190601
Can anyone please help how can i extract the minimum and maximum dates from the files in python
I need to avoid pandas or any other package as I need to read csv files in directly in pyhton
import pandas as pd
dt = pd.read_csv('you_csv.csv')
print(max(dt['dates']))
print(min(dt['dates']))
If you need to avoid pandas you can do the following which is not recommended at all:
dt = []
with open('your_csv.csv', 'r') as f:
data = f.readlines()
for row in data:
dt.append(row.split(',')[2].rstrip())
dt.pop(0)
print(max(dt))
print(min(dt))
A solution only using the available core libraries. It doesn't read the whole file into memory so should have a very low footprint and will work with larger files.
pathlib is used to get all the csv files
datetime is used to convert to dates
sys is used for user input
$ python3 date_min_max.py /usr/abc/xyz/
min date: 2019-01-01 00:00:00
max date: 2019-06-01 00:00:00
date_min_max.py
from pathlib import Path
from datetime import datetime
import sys
if len(sys.argv) > 1:
p = sys.argv[1]
else:
p = "."
files = [x for x in Path(p).iterdir() if x.suffix == ".csv"]
date_format = "%Y%m%d"
dt_max = datetime.strptime("19000101", date_format)
dt_min = datetime.strptime("30000101", date_format)
for file in files:
with file.open("r") as fh:
for i, line in enumerate(fh):
if i == 0:
continue
t = line.strip().split(",")[2]
dt_max = max(dt_max, datetime.strptime(t, date_format))
dt_min = min(dt_min, datetime.strptime(t, date_format))
print("min date: {}\nmax date: {}".format(dt_min, dt_max))
I have saved all the daily sales reports in a common folder. each file is named with the corresponding date. eg: 01-01-2019-Sales.csv, 02-01-2019-Sales.csv, etc. all the files are saved in the "C:\Desktop\Sales" folder path. now i want to extract & combine all the files which are between 05-01-2019 to 04-02-2019.
I know I can extract all the files with pandas using the below code
import pandas as pd
import glob
import os
file_path = r'C:\Desktop\Sales'
all_files = glob.glob(os.path.join(file_path,'*.csv'))
df = pd.concat([pd.read_csv(f) for f in all_files], sort=False)
But, my question is how can i extract files between 2 given specific dates using pandas/python. (using the file names which has been saved with the date) eg ; extract only the files between 05-01-2019 to 04-02-2019.
What about this
start_date = "05-01-2019"
end_date = "04-02-2019"
all_csv_files = [x for x in os.listdir(file_path) if x.endswith('.csv')]
correct_date_files = [x for x in all_csv_files
if x >= start_date + "-Sales.csv" and x <= end_date + "-Sales.csv"]
df = pd.concat([pd.read_csv(f) for f in correct_date_files], sort=False)
You basically just list all .csv files in your directory and only take the ones between the chosen dates.
I think that this piece of code will help you
import datetime
d1 = datetime.date(2019,1,1)
d2 = datetime.date(2019,2,1)
d3 = datetime.date(2019,1,20)
d4 = datetime.date(2019,2,20)
print(d1<d3<d2)
# True
print(d1<d4<d2)
# False
The dates could be compared lexically with a change to yyyy-mm-dd.
L = [ '01-01-2019-Sales.csv', '02-01-2019-Sales.csv']
>>> start = '2018-12-01'
>>> end = '2019-02-01'
>>> for file in L:
m, d, yr = file.split('-')[:3]
date = '-'.join([yr, m, d])
if start <= date <= end:
print(file)
01-01-2019-Sales.csv
02-01-2019-Sales.csv
Use the dates as comparison:
import pandas as pd
import glob
import os
from time import strptime
file_path = r'C:\Desktop\Sales'
all_files = glob.glob(os.path.join(file_path,'*.csv'))
start_date = strptime('04-02-2019', '%m-%d-%Y')
end_date = strptime('05-01-2019', '%m-%d-%Y')
df = pd.concat([pd.read_csv(f) for f in all_files
if start_date < strptime(f, '%d-%m-%Y.csv') < end_date],
sort=False)
I have a DataFrame with dates in the index. I make a Subset of the DataFrame for every Day. Is there any way to write a function or a loop to generate these steps automatically?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
import datetime as dt
#Get the channel feeds from Thinkspeak
response = requests.get("https://api.thingspeak.com/channels/518038/feeds.json?api_key=XXXXXX&results=500")
#Convert Json object to Python object
response_data = response.json()
channel_head = response_data["channel"]
channel_bottom = response_data["feeds"]
#Create DataFrame with Pandas
df = pd.DataFrame(channel_bottom)
#rename Parameters
df = df.rename(columns={"field1":"PM 2.5","field2":"PM 10"})
#Drop all entrys with at least on nan
df = df.dropna(how="any")
#Convert time to datetime object
df["created_at"] = df["created_at"].apply(lambda x:dt.datetime.strptime(x,"%Y-%m-%dT%H:%M:%SZ"))
#Set dates as Index
df = df.set_index(keys="created_at")
#Make a DataFrame for every day
df_2018_12_07 = df.loc['2018-12-07']
df_2018_12_06 = df.loc['2018-12-06']
df_2018_12_05 = df.loc['2018-12-05']
df_2018_12_04 = df.loc['2018-12-04']
df_2018_12_03 = df.loc['2018-12-03']
df_2018_12_02 = df.loc['2018-12-02']
Supposing that you do that on the first day of next week (so, exporting monday to sunday next monday, you can do that as follows:
from datetime import date, timedelta
day = date.today() - timedelta(days=7) # so, if today is monday, we start monday before
df = df.loc[today]
while day < today:
df1 = df.loc[str(day)]
df1.to_csv('mypath'+str(day)+'.csv') #so that export files have different names
day = day+ timedelta(days=1)
you can use:
from datetime import date
today = str(date.today())
df = df.loc[today]
and schedule the script using any scheduler such as crontab.
You can create dictionary of DataFrames - then select by keys for DataFrame:
dfs = dict(tuple(df.groupby(df.index.strftime('%Y-%m-%d'))))
print (dfs['2018-12-07'])