I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.
Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30
Here are the requirements: “What is the LATEST movement type for each serial number?”
I need to parse the CSV file and for each UNIQUE serial number, take the movement type that has the LATEST “posting date”.
As an example, for Serial Number 2LMXK1 the latest posting date/time is 1/5/15 at 14:00.
Here is basically what I will need to obtain:
“Serial Number 2LMXK1 has a movement type 301 and was last updated 1/5/15 14:00”.
I have started with some code that parses the CSV file and creates a dictionary.
#Import modules
import csv
import pandas as pd
fields = ['Serial number','Movement type','Posting date']
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields)
dc = df.to_dict()
#print (df['Serial number'])
for value in dc.items():
print (value)
This code works to parse the CSV and create a dictionary.
However, I need help with the date comparison and filtering techniques. How may I create another dictionary that only lists unique serial numbers with the latest posting date? Once I have created a new filtered data dictionary I can use that to import into our asset management database. The idea is that I will use python to analyze and manipulate the data before importing into our system.
Pandas is a useful library for more than just reading csv files. In fact, you don't need the csv library at all here (it's not being used in the code sample you posted)
First you need to make sure the dates are read in as dates, by using the parse_dates parameter of the read_csv function. Then you can use pandas' grouping functionality.
# parse the 3rd column (index 2) as dates
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields, parse_dates=[2])
last_movement = df.sort_values('Posting date').groupby('Serial number').last()
To create the string that you want, you can then iterate through the rows of last_movement:
for index, row in last_movement.iterrows():
print('Serial Number {} has a movement type {} and was last updated {}'
.format(index, row['Movement type'], row['Posting date']))
Which will produce the following:
Serial Number 2LMXK1 has a movement type 301 and was last updated 2015-01-05 14:00:00
Serial Number BR83GP has a movement type 301 and was last updated 2015-01-09 15:30:00
Serial Number JEMLP3 has a movement type 203 and was last updated 2015-01-07 17:30:00
Side note: Pandas should be able to read the column headings for you, so you shouldn't need the usecols parameter
The dict creation or best way to sort the list depends a little on what you want but for the parsing side of things, to convert a string into a date object so you can then do sane comparisons etc you probably want the datetime module in datetime (yes, datetime.datetime)
It's got a strptime() function that will do exactly that:
import datetime
datetime.datetime.strptime(r"1/5/15 13:00", "%d/%m/%y %H:%M")
# I've assumed you have a Day/Month/Year format
The only bit of oddness is the format specifier, which is documented here:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
(note that where it talks about zero-padded, that's for output. It'll parse non-zero padded numbers fine)
Related
I'm using pandas to analyze some data about the House Price Index of all states from quandl:
HPI_Data = quandl.get("FMAC/HPI_AK")
The data looks something like this:
HPI Alaska
Date
1975-01-31 35.105461
1975-02-28 35.465209
1975-03-31 35.843110
and so on.
I've got a second dataframe with some special dates in it:
Date
Name
David 1979-08
Allen 1980-08
Hugo 1989-09
The values for "Date" here are of "string" type and not "date".
I'd like to go 6 months back from each date in the special dataframe and see the values in the HPI dataframe.
I'd like to use .loc but I have not been able to convert the first dataframe's index from "END OF MONTH" to "MONTH". even after resampling to "1D" then back to "M".
I'd would appreciate any help, if it solves the problem a different way or the janky data deleting way I want :).
Not sure if I understand correctly. So please clarify your question if this is not correct.
You can convert a string to a pandas date time object using pd.to_datetime and use the format parameter to specify how to parse the string
import pandas as pd
# Creating a dummy Series
sr = pd.Series(['2012-10-21 09:30', '2019-7-18 12:30', '2008-02-2 10:30',
'2010-4-22 09:25', '2019-11-8 02:22'])
# Convert the underlying data to datetime
sr = pd.to_datetime(sr)
# Subtract 6 months of the datetime series
sr-pd.DateOffset(month=6)
In regards to changing the datetime to just month i.e. 2012-10-21 09:30 --> 2012-10 I would do this:
sr.dt.to_period('M')
I am a python beginner and am trying to read a csv file with pandas. The issue is that the date column in the csv has following format: 2020-03-12 00:00:00+00:00. Within the read_csv function already, I want to change the date format into isoformat (%Y-%m-%d). I tried all stackoverflow solutions but none of them work. This is my code:
import time
from datetime import date
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url,
usecols=[2, 5, 8],
index_col=['Landkreis', 'Meldedatum'],
parse_dates=['Meldedatum'],
squeeze=True
).sort_index()
Current result
The column "Meldedatum" should only show the date, not the hours and minutes. Yet, I can't change the format because it is an index column.
Your help is much appreciated!
Read your csv normally into dataframe without specifying any format.
Then do this:
countries['Meldedatum'] = pd.to_datetime(countries['Meldedatum'])
This should give you the format you want.
That's just how pandas displays a datetime object. It always stores fields for hours/minutes/seconds/miliseconds, even if they are all set to zero. You can't change this internal representation.
You can, however, cast datetime objects to string, in order to format their representation the way you want. Keep in mind that you lose all functionality of a datetime object along the way.
It looks like you want to count the number of occurrences per day. If that's the case, you should use a groupby object. We don't need to set the index columns or parse dates in this case. We can also convert the representation of the datetime objects to strings, if that's your preference:
import time
from datetime import date
import pandas as pd
# get the data
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url, usecols=[2, 5, 8], index_col=None, squeeze=True).sort_index()
# modify dates to strings
countries['Meldedatum'] = countries.Meldedatum.astype(str).apply(lambda x: x.split('T')[0])
# group by Landkreis and Meldedatum
grouped_countries = countries.groupby(['Landkreis', 'Meldedatum']).count()
print(grouped_countries)
# output:
AnzahlFall
Landkreis Meldedatum
LK Ahrweiler 2020-03-12 5
2020-03-13 2
2020-03-14 1
2020-03-16 3
2020-03-17 5
... ...
StadtRegion Aachen 2020-04-14 8
2020-04-15 37
2020-04-16 23
2020-04-17 18
2020-04-18 5
Hello everyone I have a cvs file which contains a months worth of data in hourly intervals. I need to get an average value of one of the columns for the time intervals of 12:00am-3:00am for the entire month. I am using pandas.DataFrame to try and do this.
Sample of data I am using
DateTime current voltage
11/1/2014 12:00 1.122061402 4.058617834
11/1/2014 1:00 1.120534925 4.060912132
11/1/2014 2:00 1.119349897 4.058656072
11/1/2014 3:00 1.118277733 4.060912132
11/1/2014 4:00 1.120365636 4.060912132
11/1/2014 5:00 1.120365636 4.060912132
i'd like to average column 2 from 12am-3am everyday for the entire month. I am thinking using a conditional statement on the time would be a good option however I am unsure of how to implement that conditional statement on date/time data.
I will assume that you have already imported the file into a Pandas dataframe named df.
Confirm that your "DateTime" field is being recognized by pandas as a DateTime by checking the value of df.dtypes. If not, recast e.g. with:
df['DateTime'] = pd.to_datetime(df['DateTime'])
Double-check that times like 12 AM, 1 PM, etc. are being handled properly. (You have not indicated anything to distinguish 12 AM from 12 PM etc. in your dataset.) If not, you will need to devise an appropriate method to correct them or re-export them from the original source.
Create a DatetimeIndex from your DateTime field:
df = df.set_index(pd.DatetimeIndex(df['DateTime']))
Now take Dmitry's suggestion (lightly modified):
>>> df.between_time('0:00', '3:00').resample('1D').mean()
The index of the result will show the beginning of the time interval being averaged.
Edited to take into account new info in the comments.
I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.
Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30
I need to parse this CSV and return a JSON object of only the latest movement type for each unique serial number.
I have a script successfully achieves:
* Parse the CSV
* Sort by Date and group by Serial number, get the latest date
* Convert panda dataframe to JSON object (but missing serialnumber)
ISSUE:
The "serial number" column is omitted when converting the dataframe to a JSON object. I think the problem is the fact that "serial number" is used as the groupby value. I tried the builtin to_json but that did not return the data in the correct format.
The data frame contains the correct filtered data that I need as we can see in print(last_movement).
How can I create a JSON object and obtain all of the columns in the dataframe?
#Import python modules
import ujson as json
import pandas as pd
import numpy as np
#CSV parse to panda dataframe
pdata = pd.read_csv('import.csv', skipinitialspace=True, parse_dates=[2])
#Sort by posting date to get rows by latest posting date/time only
last_movement = pdata.sort_values('Posting date').groupby('Serial number').last()
print(last_movement)
# RETURNS
# We know the dataframe is correct
#
# Movement type Posting date
#Serial number
#2LMXK1 301 2015-01-05 14:00:00
#BR83GP 301 2015-01-09 15:30:00
#JEMLP3 203 2015-01-07 17:30:00
out = last_movement.to_json()
print(out)
#RETURNS a JSON object that is aggregated by serial number
# {"Movement type":{"2LMXK1":301,"BR83GP":301,"JEMLP3":203},"Posting date":{"2LMXK1":1420466400000,"BR83GP":1420817400000,"JEMLP3":1420651800000}}
Here is the output when I tried a custom function to iterate the values and convert the dataframe to JSON object. Although this is a little bit better, it still does not have the serial number. It appears as though the "groupby" aggregation is causing some issues with the serial number column. Perhaps I need to somehow "ungroup" the resulting dataframe so I have my filtered data and can convert it to a JSON object.
#Convert panda dataframe to json object
def tojson(df):
d = [
dict([
(colname, row[i])
for i,colname in enumerate(df.columns)
])
for row in df.values
]
return json.dumps(d)
out = tojson(last_movement)
print(out)
# RETURNS
# MISSING SERIAL NUMBER
# [{"Posting date":1420466400,"Movement type":301},{"Posting date":1420817400,"Movement type":301},{"Posting date":1420651800,"Movement type":203}]
I have located the answer. Set as_index=False in the groupby param. The JSON object is in the correct format and includes the serial number with this change.
Converting a Pandas GroupBy object to DataFrame
Aggregation functions will not return the groups that you are
aggregating over if they are named columns, when as_index=True, the
default. The grouped columns will be the indices of the returned
object.
Passing as_index=False will return the groups that you are aggregating
over, if they are named columns.
I'm reading a huge CSV with a date field in the format YYYYMMDD and I'm using the following lambda to convert it when reading:
import pandas as pd
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=lambda t:pd.to_datetime(str(t),
format='%Y%m%d', coerce=True))
This function is very slow though.
Any suggestion to improve it?
Note: As #ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates that defaults to True
Try using this function for parsing dates:
def lookup(date_pd_series, format=None):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
return date_pd_series.map(dates)
Use it like:
df['date-column'] = lookup(df['date-column'], format='%Y%m%d')
Benchmarks:
$ python date-parse.py
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
Source: https://github.com/sanand0/benchmarks/tree/master/date-parse
Great suggestion #EdChum! As #EdChum suggests, using infer_datetime_format=True can be significantly faster. Below is my example.
I have a file of temperature data from a sensor log, which looks like this:
RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...
My code reads the csv and parses the date (parse_dates=['Date']).
With infer_datetime_format=False, it takes 8min 8sec:
Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.
With infer_datetime_format=True, it takes 13sec:
Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.
Unless you're stuck with a very old version of pandas, pre 0.25, this answer is not for you.
The functionality described here has been merged into pandas in version 0.25
Streamlined date parsing with caching
Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won't need to iterate over all the data twice if you do it right away. You also don't have to store it as strings in memory.
We can define our own date parser that utilizes a cache for the dates it has already seen.
import pandas as pd
cache = {}
def cached_date_parser(s):
if s in cache:
return cache[s]
dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
cache[s] = dt
return dt
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=cached_date_parser)
Has the same advantages as #fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.
Since pandas version 0.25 the function pandas.read_csv accepts a cache_dates=boolean (which defaults to True) keyword argument. So no need to write your own function for caching as done in the accepted answer.
No need to specify a date_parser, pandas is able to parse this without any trouble, plus it will be much faster:
In [21]:
import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null datetime64[ns]
val 3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:
df
Out[22]:
date val
0 2012-06-08 12321
1 2013-06-08 12321
2 2014-03-08 12321
Try the standard library:
import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")
However, I don't really know if this is much faster than pandas.
Since your format is so simple, what about
def parse(t):
string_ = str(t)
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
EDIT you say you need to take care of invalid data.
def parse(t):
string_ = str(t)
try:
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
except:
return default_datetime #you should define that somewhere else
All in all, I'm a bit conflicted about the validity of your problem:
you need to be fast, but still you get your data from a CSV
you need to be fast, but still need to deal with invalid data
That's kind of contradicting; my personal approach here would be assuming that your "huge" CSV just needs to be brought into a better-performing format once, and you either shouldn't care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data--there's so many formats that don't rely on string parsing.
If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.
created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC
df["date"]= df["created_at"].apply(lambda x: str(x)[:10])
df["date"] = pd.to_datetime(df["date"])
I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:
read the file row by row using Python3.7 native csv.reader
convert all 4 numeric columns using float() and
parse the date column with datetime.datetime.fromisoformat()
and Behold:
finally convert the list to a DataFrame (!)**
It baffles me how can this be faster than native pandas pd.read_csv(...)... can someone explain?