Python panda dataframe to JSON object omitting a groupby column - python

I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.
Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30
I need to parse this CSV and return a JSON object of only the latest movement type for each unique serial number.
I have a script successfully achieves:
* Parse the CSV
* Sort by Date and group by Serial number, get the latest date
* Convert panda dataframe to JSON object (but missing serialnumber)
ISSUE:
The "serial number" column is omitted when converting the dataframe to a JSON object. I think the problem is the fact that "serial number" is used as the groupby value. I tried the builtin to_json but that did not return the data in the correct format.
The data frame contains the correct filtered data that I need as we can see in print(last_movement).
How can I create a JSON object and obtain all of the columns in the dataframe?
#Import python modules
import ujson as json
import pandas as pd
import numpy as np
#CSV parse to panda dataframe
pdata = pd.read_csv('import.csv', skipinitialspace=True, parse_dates=[2])
#Sort by posting date to get rows by latest posting date/time only
last_movement = pdata.sort_values('Posting date').groupby('Serial number').last()
print(last_movement)
# RETURNS
# We know the dataframe is correct
#
# Movement type Posting date
#Serial number
#2LMXK1 301 2015-01-05 14:00:00
#BR83GP 301 2015-01-09 15:30:00
#JEMLP3 203 2015-01-07 17:30:00
out = last_movement.to_json()
print(out)
#RETURNS a JSON object that is aggregated by serial number
# {"Movement type":{"2LMXK1":301,"BR83GP":301,"JEMLP3":203},"Posting date":{"2LMXK1":1420466400000,"BR83GP":1420817400000,"JEMLP3":1420651800000}}
Here is the output when I tried a custom function to iterate the values and convert the dataframe to JSON object. Although this is a little bit better, it still does not have the serial number. It appears as though the "groupby" aggregation is causing some issues with the serial number column. Perhaps I need to somehow "ungroup" the resulting dataframe so I have my filtered data and can convert it to a JSON object.
#Convert panda dataframe to json object
def tojson(df):
d = [
dict([
(colname, row[i])
for i,colname in enumerate(df.columns)
])
for row in df.values
]
return json.dumps(d)
out = tojson(last_movement)
print(out)
# RETURNS
# MISSING SERIAL NUMBER
# [{"Posting date":1420466400,"Movement type":301},{"Posting date":1420817400,"Movement type":301},{"Posting date":1420651800,"Movement type":203}]

I have located the answer. Set as_index=False in the groupby param. The JSON object is in the correct format and includes the serial number with this change.
Converting a Pandas GroupBy object to DataFrame
Aggregation functions will not return the groups that you are
aggregating over if they are named columns, when as_index=True, the
default. The grouped columns will be the indices of the returned
object.
Passing as_index=False will return the groups that you are aggregating
over, if they are named columns.

Related

How to change pandas' Datetime Index from "End of month" To just "Month"

I'm using pandas to analyze some data about the House Price Index of all states from quandl:
HPI_Data = quandl.get("FMAC/HPI_AK")
The data looks something like this:
HPI Alaska
Date
1975-01-31 35.105461
1975-02-28 35.465209
1975-03-31 35.843110
and so on.
I've got a second dataframe with some special dates in it:
Date
Name
David 1979-08
Allen 1980-08
Hugo 1989-09
The values for "Date" here are of "string" type and not "date".
I'd like to go 6 months back from each date in the special dataframe and see the values in the HPI dataframe.
I'd like to use .loc but I have not been able to convert the first dataframe's index from "END OF MONTH" to "MONTH". even after resampling to "1D" then back to "M".
I'd would appreciate any help, if it solves the problem a different way or the janky data deleting way I want :).
Not sure if I understand correctly. So please clarify your question if this is not correct.
You can convert a string to a pandas date time object using pd.to_datetime and use the format parameter to specify how to parse the string
import pandas as pd
# Creating a dummy Series
sr = pd.Series(['2012-10-21 09:30', '2019-7-18 12:30', '2008-02-2 10:30',
'2010-4-22 09:25', '2019-11-8 02:22'])
# Convert the underlying data to datetime
sr = pd.to_datetime(sr)
# Subtract 6 months of the datetime series
sr-pd.DateOffset(month=6)
In regards to changing the datetime to just month i.e. 2012-10-21 09:30 --> 2012-10 I would do this:
sr.dt.to_period('M')

Python Pandas read_csv function does not allow to change parsed dates into required format

I am a python beginner and am trying to read a csv file with pandas. The issue is that the date column in the csv has following format: 2020-03-12 00:00:00+00:00. Within the read_csv function already, I want to change the date format into isoformat (%Y-%m-%d). I tried all stackoverflow solutions but none of them work. This is my code:
import time
from datetime import date
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url,
usecols=[2, 5, 8],
index_col=['Landkreis', 'Meldedatum'],
parse_dates=['Meldedatum'],
squeeze=True
).sort_index()
Current result
The column "Meldedatum" should only show the date, not the hours and minutes. Yet, I can't change the format because it is an index column.
Your help is much appreciated!
Read your csv normally into dataframe without specifying any format.
Then do this:
countries['Meldedatum'] = pd.to_datetime(countries['Meldedatum'])
This should give you the format you want.
That's just how pandas displays a datetime object. It always stores fields for hours/minutes/seconds/miliseconds, even if they are all set to zero. You can't change this internal representation.
You can, however, cast datetime objects to string, in order to format their representation the way you want. Keep in mind that you lose all functionality of a datetime object along the way.
It looks like you want to count the number of occurrences per day. If that's the case, you should use a groupby object. We don't need to set the index columns or parse dates in this case. We can also convert the representation of the datetime objects to strings, if that's your preference:
import time
from datetime import date
import pandas as pd
# get the data
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url, usecols=[2, 5, 8], index_col=None, squeeze=True).sort_index()
# modify dates to strings
countries['Meldedatum'] = countries.Meldedatum.astype(str).apply(lambda x: x.split('T')[0])
# group by Landkreis and Meldedatum
grouped_countries = countries.groupby(['Landkreis', 'Meldedatum']).count()
print(grouped_countries)
# output:
AnzahlFall
Landkreis Meldedatum
LK Ahrweiler 2020-03-12 5
2020-03-13 2
2020-03-14 1
2020-03-16 3
2020-03-17 5
... ...
StadtRegion Aachen 2020-04-14 8
2020-04-15 37
2020-04-16 23
2020-04-17 18
2020-04-18 5

"ValueError: year is out of range" with IEX Cloud API

I have a CSV (which I converted to a dataframe) consisting of company/stock data:
Symbol Quantity Price Cost date
0 DIS 9 NaN 20 20180531
1 SBUX 5 NaN 30 20180228
2 PLOW 4 NaN 40 20180731
3 SBUX 2 NaN 50 20191130
4 DIS 11 NaN 25 20171031
And I am trying to use the IEX Cloud API to pull in the stock Price for a given date. And then ultimately write that to the dataframe. Per the IEX Cloud API documentation, I can use the get_historical_data function, where the 2nd argument is the date: df = get_historical_data("SBUX", "20190617", close_only=True)
Everything works fine so long as I pass in a raw date directly to the function (e.g., 20190617), but if I try using a variable instead, I get ValueError: year 20180531 is out of range. I'm guessing something is wrong with the date format in my original CSV?
Here is my full code:
import os
from iexfinance.stocks import get_historical_data
import pandas as pd
os.environ['IEX_API_VERSION'] = 'iexcloud-sandbox'
os.environ['IEX_TOKEN'] = 'Tsk_5798c0ab124d49639bb1575b322841c4'
input_df = pd.read_csv("all.csv")
for index, row in input_df.iterrows():
symbol = row['Symbol']
date = row['date']
temp_df = get_historical_data(symbol, date, close_only=True, output_format='pandas')
price = temp_df['close'].values[0]
print(temp_df)
Note that this is a public token, so it's okay to use
When you called get_historical_data("SBUX", "20190617", close_only=True)
you passed the date as a string.
But when you read a DataFrame using read_csv, this column
(containing 8-digit strings) is converted to an integer.
This difference can be the source of problem.
Try 2 things:
convert this column to string, or
while reading the DataFrame, pass dtype={'date': str},
so that this column will be read as a string.
You should be fine if you transform your date row into datetime.
import pandas as pd
df = pd.DataFrame(['20180531'])
pd.to_datetime(df.values[:, 0])
Out[43]: DatetimeIndex(['2018-05-31'], dtype='datetime64[ns]', freq=None)
Then, your column will be correctly formatted for use elsewhere. You can insert this line below pd.read_csv():
df['date'] = pd.to_datetime(df['date'])

Importing excel data with pandas showing date-time despite being date value

I've just started using pandas and I'm trying to import an excel file but I get Date-Time values like 01/01/2019 00:00:00 instead of the 01/01/2019 format. The source data is Date by the way, not Date-Time.
I'm using the following code
import pandas as pd
df = pd.read_excel (r'C:\Users\abcd\Desktop\KumulData.xlsx')
print(df)
The columns that have date in them are "BDATE", "BVADE" and "AKTIVASYONTARIH" which correspond to 6th, 7th and 11th columns.
What code can I use to see the dates as Date format in Pandas Dataframe?
Thanks.
If they're already datetimes then you can extract the date part and reassign the columns:
df[["BDATE", "BVADE", "AKTIVASYONTARIH"]] = df[["BDATE", "BVADE", "AKTIVASYONTARIH"]].apply(lambda x: x.dt.date)
solution updated..
For the sake of completeness, your goal can be achieved by:
df[["BDATE", "BVADE", "AKTIVASYONTARIH"]].astype("datetime64[D]")

Python CSV data analysis based on date time

I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.
Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30
Here are the requirements: “What is the LATEST movement type for each serial number?”
I need to parse the CSV file and for each UNIQUE serial number, take the movement type that has the LATEST “posting date”.
As an example, for Serial Number 2LMXK1 the latest posting date/time is 1/5/15 at 14:00.
Here is basically what I will need to obtain:
“Serial Number 2LMXK1 has a movement type 301 and was last updated 1/5/15 14:00”.
I have started with some code that parses the CSV file and creates a dictionary.
#Import modules
import csv
import pandas as pd
fields = ['Serial number','Movement type','Posting date']
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields)
dc = df.to_dict()
#print (df['Serial number'])
for value in dc.items():
print (value)
This code works to parse the CSV and create a dictionary.
However, I need help with the date comparison and filtering techniques. How may I create another dictionary that only lists unique serial numbers with the latest posting date? Once I have created a new filtered data dictionary I can use that to import into our asset management database. The idea is that I will use python to analyze and manipulate the data before importing into our system.
Pandas is a useful library for more than just reading csv files. In fact, you don't need the csv library at all here (it's not being used in the code sample you posted)
First you need to make sure the dates are read in as dates, by using the parse_dates parameter of the read_csv function. Then you can use pandas' grouping functionality.
# parse the 3rd column (index 2) as dates
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields, parse_dates=[2])
last_movement = df.sort_values('Posting date').groupby('Serial number').last()
To create the string that you want, you can then iterate through the rows of last_movement:
for index, row in last_movement.iterrows():
print('Serial Number {} has a movement type {} and was last updated {}'
.format(index, row['Movement type'], row['Posting date']))
Which will produce the following:
Serial Number 2LMXK1 has a movement type 301 and was last updated 2015-01-05 14:00:00
Serial Number BR83GP has a movement type 301 and was last updated 2015-01-09 15:30:00
Serial Number JEMLP3 has a movement type 203 and was last updated 2015-01-07 17:30:00
Side note: Pandas should be able to read the column headings for you, so you shouldn't need the usecols parameter
The dict creation or best way to sort the list depends a little on what you want but for the parsing side of things, to convert a string into a date object so you can then do sane comparisons etc you probably want the datetime module in datetime (yes, datetime.datetime)
It's got a strptime() function that will do exactly that:
import datetime
datetime.datetime.strptime(r"1/5/15 13:00", "%d/%m/%y %H:%M")
# I've assumed you have a Day/Month/Year format
The only bit of oddness is the format specifier, which is documented here:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
(note that where it talks about zero-padded, that's for output. It'll parse non-zero padded numbers fine)

Categories