read csv files in python - python

I need to read one column from the multiple csv file present in folder and then extract minimum and maximum dates from the column.
For e.g. if i have folder path "/usr/abc/xyz/" and multiple csv files are present as below
aaa.csv
bbb.csv
ccc.csv
and the files are containing data
aaa.csv is containing the data
name,address,dates
xxx,11111,20190101
yyy,22222,20190201
zzz,33333,20190101
bbb.csv is containing the data
name,address,dates
fff,11111,20190301
ggg,22222,20190501
hhh,33333,20190601
so I need to extract the minimum and maximum dates from the files and in the above case the date range should be 20190101 to 20190601
Can anyone please help how can i extract the minimum and maximum dates from the files in python
I need to avoid pandas or any other package as I need to read csv files in directly in pyhton

import pandas as pd
dt = pd.read_csv('you_csv.csv')
print(max(dt['dates']))
print(min(dt['dates']))
If you need to avoid pandas you can do the following which is not recommended at all:
dt = []
with open('your_csv.csv', 'r') as f:
data = f.readlines()
for row in data:
dt.append(row.split(',')[2].rstrip())
dt.pop(0)
print(max(dt))
print(min(dt))

A solution only using the available core libraries. It doesn't read the whole file into memory so should have a very low footprint and will work with larger files.
pathlib is used to get all the csv files
datetime is used to convert to dates
sys is used for user input
$ python3 date_min_max.py /usr/abc/xyz/
min date: 2019-01-01 00:00:00
max date: 2019-06-01 00:00:00
date_min_max.py
from pathlib import Path
from datetime import datetime
import sys
if len(sys.argv) > 1:
p = sys.argv[1]
else:
p = "."
files = [x for x in Path(p).iterdir() if x.suffix == ".csv"]
date_format = "%Y%m%d"
dt_max = datetime.strptime("19000101", date_format)
dt_min = datetime.strptime("30000101", date_format)
for file in files:
with file.open("r") as fh:
for i, line in enumerate(fh):
if i == 0:
continue
t = line.strip().split(",")[2]
dt_max = max(dt_max, datetime.strptime(t, date_format))
dt_min = min(dt_min, datetime.strptime(t, date_format))
print("min date: {}\nmax date: {}".format(dt_min, dt_max))

Related

How to get all files modified in a certain time window?

I want to get all files modified/created in the last 1 hour with Python. I tried this code but it's only getting the last file which was created:
import glob
import os
list_of_files = glob.glob('c://*')
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)
If I created 10 files it shows only last one. How to get all files created in the last 1 hour?
You can do basically this:
get the list of files
get the time for each of them (also check os.path.getmtime() for updates)
use datetime module to get a value to compare against (that 1h)
compare
For that I've used a dictionary to both store paths and timestamps in a compact format. Then you can sort the dictionary by its values (dict.values()) (which is a float, timestamp) and by that you will get the latest files created within 1 hour that are sorted. (e.g. by sorted(...) function):
import os
import glob
from datetime import datetime, timedelta
hour_files = {
key: val for key, val in {
path: os.path.getctime(path)
for path in glob.glob("./*")
}.items()
if datetime.fromtimestamp(val) >= datetime.now() - timedelta(hours=1)
}
Alternatively, without the comprehension:
files = glob.glob("./*")
times = {}
for path in files:
times[path] = os.path.getctime(path)
hour_files = {}
for key, val in times.items():
if datetime.fromtimestamp(val) < datetime.now() - timedelta(hours=1):
continue
hour_files[key] = val
Or, perhaps your folder is just a mess and you have too many files. In that case, approach it incrementally:
hour_files = {}
for file in glob.glob("./*"):
timestamp = os.path.getctime(file)
if datetime.fromtimestamp(timestamp) < datetime.now() - timedelta(hours=1):
continue
hour_files[file] = timestamp
Here is another solution, shorter, that uses only the os package:
import os
directory = "/path/to/directory"
latest_file = os.popen(f"ls -t {directory}").read().split("\n")[0]

Convert any Date String Format to a specific date format string

I am making a generic tool which can take up any csv file.I have a csv file which looks something like this. The first row is the column name and the second row is the type of variable.
sam.csv
Time,M1,M2,M3,CityName
temp,num,num,num,city
20-06-13,19,20,0,aligarh
20-02-13,25,42,7,agra
20-03-13,23,35,4,aligarh
20-03-13,21,32,3,allahabad
20-03-13,17,27,1,aligarh
20-02-13,16,40,5,aligarh
Other CSV file looks like:
Time,M1,M2,M3,CityName
temp,num,num,num,city
20/8/16,789,300,10,new york
12/6/17,464,67,23,delhi
12/6/17,904,98,78,delhi
So, there could be any date format or it could be a time stamp.I want to convert it to "20-May-13" or "%d-%b-%y" format string everytime and sort the column from oldest date to the newest date. I have been able to search the column name where the type is "temp" and try to convert it to the required format but all the methods require me to specify the original format which is not possible in my case.
Code--
import csv
import time
from datetime import datetime,date
import pandas as pd
import dateutil
from dateutil.parser import parse
filename = 'sam.csv'
data_date = pd.read_csv(filename)
column_name = data_date.ix[:, data_date.loc[0] == "temp"]
column_work = column_name.iloc[1:]
column_some = column_work.iloc[:,0]
default_date = datetime.combine(date.today(), datetime.min.time()).replace(day=1)
for line in column_some:
print(parse(line[0], default=default_date).strftime("%d-%b-%y"))
In "sam.csv", the dates are in 2013. But in my output it gives the correct format but all the 6 dates as 2-Mar-2018
You can use the dateutil library for converting any date format to your required format.
Ex:
import csv
from dateutil.parser import parse
p = "PATH_TO_YOUR_CSV.csv" #I have used your sample data to test.
with open(p, "r") as infile:
reader = csv.reader(infile)
next(reader) #Skip Header
next(reader) #Skip Header
for line in reader:
print(parse(line[0]).strftime("%d-%B-%y")) #Parse Date and convert it to date-month-year
Output:
20-June-13
20-February-13
20-March-13
20-March-13
20-March-13
20-February-13
20-August-16
06-December-17
06-December-17
MoreInfo on Dateutil

Open files older than 3 days of date stamp in file name - Python 2.7

** Problem **
I'm trying to open (in python) files older than 3 days of the date stamp which is in the current name. Example: 2016_08_18_23_10_00 - JPN - MLB - Mickeymouse v Burgerface.ply. So far I can create a date variable, however I do not know how to search for this variable in a filename. I presume I need to convert it to a string first?
from datetime import datetime, timedelta
import os
import re
path = "C:\Users\michael.lawton\Desktop\Housekeeper"
## create variable d where current date time is subtracted by 3 days ##
days_to_subtract = 3
d = datetime.today() - timedelta(days=days_to_subtract)
print d
## open file in dir where date in filename = d or older ##
for filename in os.listdir(path):
if re.match(d, filename):
with open(os.path.join(path, filename), 'r') as f:
print line,
Any help will be much appreciated
You can use strptime for this. It will convert your string (assuming it is correctly formatted) into a datetime object which you can use to compare if your file is older than 3 days based on the filename:
from datetime import datetime
...
lines = []
for filename in os.listdir(path):
date_filename = datetime.strptime(filename.split(" ")[0], '%Y_%m_%d_%H_%M_%S')
if date_filename < datetime.datetime.now()-datetime.timedelta(days=days_to_subtract):
with open(os.path.join(path, filename), 'r') as f:
lines.extend(f.readlines()) # put all lines into array
If the filename is 2016_08_18_23_10_00 - JPN - MLB - Mickeymouse v Burgerface.ply the datetime part will be extracted with filename.split(" ")[0]. Then we can use that to check if it is older than three days using datetime.timedelta
To open all files in the given directory that contain a timestamp in their name older than 3 days:
#!/usr/bin/env python2
import os
import time
DAY = 86400 # POSIX day in seconds
three_days_ago = time.time() - 3 * DAY
for filename in os.listdir(dirpath):
time_string = filename.partition(" ")[0]
try:
timestamp = time.mktime(time.strptime(time_string, '%Y_%m_%d_%H_%M_%S'))
except Exception: # can't get timestamp
continue
if timestamp < three_days_ago: # old enough to open
with open(os.path.join(dirpath, filename)) as file: # assume it is a file
for line in file:
print line,
The code assumes that the timestamps are in the local timezone. It may take DST transitions into account on platforms where C mktime() has access to the tz database (if it doesn't matter whether the file is 72 or 73 hours old in your case then just ignore this paragraph).
Consider using file metadata such as "the last modification time of a file" instead of extracting the timestamp from its name: timestamp = os.path.getmtime(path).

How to find earliest and latest dates from a CSV File [Python]

My CSV file is arranged so that there's a row named "Dates," and below that row is a gigantic column of a million dates, in the traditional format like "4/22/2015" and "3/27/2014".
How can I write a program that identifies the earliest and latest dates in the CSV file, while maintaining the original format (month/day/year)?
I've tried
for line in count_dates:
dates = line.strip().split(sep="/")
all_dates.append(dates)
print (all_dates)
I've tried to take away the "/" and replace it with a blank space, but it does not print anything.
import pandas as pd
import datetime
df = pd.read_csv('file_name.csv')
df['Dates'] = df['Dates'].apply(lambda v: datetime.datetime.strptime(v, '%m/%d/%Y'))
print df['Dates'].min(), df['Dates'].max()
Considering you have a large file, reading it in its entirety into memory is a bad idea.
Read the file line by line, manually keeping track of the earliest and latest dates. Use datetime.datetime.strptime to convert the strings to dates (takes the string format as parameter.
import datetime
with open("input.csv") as f:
f.readline() # get the "Dates" header out of the way
first = f.readline().strip()
earliest = datetime.datetime.strptime(first, "%m/%d/%Y")
latest = datetime.datetime.strptime(first, "%m/%d/%Y")
for line in f:
date = datetime.datetime.strptime(line.strip(), "%m/%d/%Y")
if date < earliest: earliest = date
if date > latest: latest = date
print "Earliest date:", earliest
print "Latest date:", latest
Let's open the csv file, read out all the dates. Then use strptime to turn them into comparable datetime objects (now, we can use max). Lastly, let's print out the biggest (latest) date
import csv
from datetime import datetime as dt
with open('path/to/file') as infile:
dt.strftime(max(dt.strptime(row[0], "%m/%d/%Y") \
for row in csv.reader(infile)), \
"%m/%d/%Y")
Naturally, you can use min to get the earliest date. However, this takes two linear runs, and you can do this with just one, if you are willing to do some heavy lifting yourself:
import csv
from datetime import datetime as dt
with open('path/to/file') as infile:
reader = csv.reader(infile)
date, *_rest = next(infile)
date = dt.strptime(date, "%m/%d/%Y")
for date, *_rest in reader:
date = dt.strptime(date, "%m/%d/%Y")
earliest = min(date, earliest)
latest = max(date, latest)
print("earliest:", dt.strftime(earliest, "%m/%d/%Y"))
print("latest:", dt.strftime(latest, "%m/%d/%Y"))
A bit of an RTFM answer: Open the file in csv format (see the csv library), and then iterate line by line converting the field that is a date into a date object (see the docs for converting a string to a date object), and if it is less than minimum so far store it as minimum, similar for max, with a special condition on the first line that the date becomes both min and max dates.
Or for some overkill you could just use Pandas to read it into a data frame specifying the specific column as date format then just use max & min.
I think it is more convenient to use pandas for this purpose.
import pandas as pd
df = pd.read_csv('file_name.csv')
df['name_of_column_with_date'] = pd.to_datetime(df['name_of_column_with_date'], format='%-m/%d/%Y')
print('min_date{}'.format(min(df['name_of_column_with_date'])))
print('max_date{}'.format(max(df['name_of_column_with_date'])))
The built-in functions work well with Pandas Dataframes.
For more understanding of the format feature in pd.to_datatime you can use Python strftime cheat sheet

csv file to date format

I would like to read in a csv file of dates (shown below) and loop through it using solar.GetAltitude on each date to calculate a list of sun altitudes. (I'm using Python 2.7.2 on Windows 7 Enterprise.)
CSV file: TimeStamp 01/01/2014 00:10 01/01/2014 00:20 01/01/2014 00:30
01/01/2014 00:40
My code gives the following error ValueError: unconverted data remains:. This suggests the wrong date format, but it works fine on a single date, rather than a string of dates.
I've researched this topic carefully on Stack Overflow. I've also tried the map function, np.datetime64 and reading to a list rather than a string but get a different error referring to no attribute 'year'.
I'd really appreciate any help because I'm running out of ideas.
import datetime
from datetime import datetime
import julian
import solar
from solar import *
import os
import csv
# Create lists to hold the records.
dates = []
# Navigate to correct directory
os.chdir('D:\\Di_Python')
filename = 'SPA timestamp small.csv'
# Read through the entire file, skip the first line
with open(filename) as f:
# Create a csv reader object.
reader = csv.reader(f)
# Ignore the header row.
next(reader)
# Store the dates in the appropriate list.
for row in reader:
dates.append(row)
print row
# Change list to string so can use a function on it
lines = []
for date in dates:
lines.append('\t'.join(map(str, date)))
result = '\n'.join(lines)
print result
minutes = []
minutes.append(datetime.datetime.strptime(result,'%d/%m/%Y %H:%M'))
# Inputs
latitude_deg = 52.8
longitude_deg = -1.2
elevation = 0
# i should be 52560 - 10 min interval whole year
for i in minutes:
utc_datetime = i
altitude = solar.GetAltitude(latitude_deg, longitude_deg, utc_datetime)
altitude_list.append(altitude)
print altitude_list
First of all, the code is not indented properly making it harder to guess.
I think the input to datetime.datetime.strptime is not correct. You create result by using a '\n'.join(...) but the format string does not contain the '\n'. Creating a string from the list of dates seems unnecessary to me.
I think what you want is this:
for date in dates:
minutes.append(datetime.datetime.strptime(date, '%d/%m/%Y %H:%M'))
Note that the names you use for the lists are misleading as minutes holds datetime.datetime objects rather than minute values!
Many thanks to Vikramis and Lutz Horn for their help and comments. After experimenting with Vikramis' code, I achieved a working version which I have copied below.
My error occurred at line 40:
minutes.append(datetime.datetime.strptime(result,'%d/%m/%Y %H:%M'))
I found that I needed to create a string from the list to avoid the following error "TypeError: must be string, not list". I have now tidied this up by using (str(date) to replace the for loop and hopefully used more sensible names.
My problem was with the formatting. It needs to be
"['%d/%m/%Y %H:%M']" because I'm accessing items in a list, rather than "'%d/%m/%Y %H:%M'" which works in the shell for a single date.
import datetime
from datetime import datetime
import julian
import solar
from solar import *
import os
import csv
# Create lists to hold the records.
dates = []
datetimeObj = []
altitude_list = []
# Navigate to correct directory
os.chdir('D:\\Di_Python')
filename = 'SPA timestamp small.csv'
# Read through the entire file, skip the first line
with open(filename) as f:
# Create a csv reader object.
reader = csv.reader(f)
# Ignore the header row.
next(reader)
# Store the dates in the appropriate list.
for row in reader:
dates.append(row)
print row
# Change format to datetime
# str(date) used to avoid TypeError: must be string, not list
for date in dates:
datetimeObj.append(datetime.datetime.strptime(str(date),"['%d/%m/%Y %H:%M']"))
for j in datetimeObj:
print j
# Inputs
latitude_deg = 52.8
longitude_deg = -1.2
elevation = 0
# i should be 52560 - 10 min interval whole year
for i in datetimeObj:
utc_datetime = i
altitude = solar.GetAltitude(latitude_deg, longitude_deg, utc_datetime)
print altitude
altitude_list.append(altitude)
# print altitude_list

Categories