Two csv files read differently in python - python

I have two .csv files with data arranged date-wise. I want to compute the monthly accumulated value for each month and for all the years. While reading the csv files, it reads without any error. However, while computing the monthly accumulated values, for one times series (in one csv file), it is doing it correctly. But, for the other time series, the same code malfunctions. The only difference I notice is, when I read the first csv file (with a 'Date' and 'Value' column, and total no. of rows = 826), the dataframe has 827 rows (last row as nan). This nan thing is not observed for the other csv file.
Please note that my timeseries starts from 06-06-20xx to 01-10-20xx every year from 2008-2014. I am obtaining the monthly accumulated value for each month and then removing the zero values (for months Jan-May and Nov-Dec). When my code runs, for the first csv, I get monthly accumulated values starting from June month of 2008. But, for the second, its starts from January 2008 (and has a non-zero value, which ideally should be zero).
Since I am new in python coding, I cannot figure out where the issue is. Any help is appreciated. Thanks in advance.
Here is my code:
import pandas as pd
import numpy as np
# read the csv file
df_obs = pd.read_csv("..path/first_csv.csv")
df_fore = pd.read_csv("..path/second_csv.csv")
# convert 'Date' column to datetime index
df_obs['Date'] = pd.to_datetime(df_obs['Date'])
df_fore['Date'] = pd.to_datetime(df_fore['Date'])
# perform GroupBy operation over monthly frequency
monthly_accumulated_obs = df_obs.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
monthly_accumulated_fore = df_fore.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()

Sometimes more verbose but explicit solutions work better and are more flexible, so here's an alternative one, using convtools:
from datetime import date, datetime
from convtools import conversion as c
from convtools.contrib.tables import Table
# generate an ad hoc grouper;
# it's a simple function to be reused further
converter = (
c.group_by(c.item("Date"))
.aggregate(
{
"Date": c.item("Date"),
"Observed": c.ReduceFuncs.Sum(c.item("Observed")),
}
)
.gen_converter()
)
# read the stream of prepared rows
rows = (
Table.from_csv("..path/first_csv.csv", header=True)
.update(
Date=c.call_func(
datetime.strptime, c.col("Date"), "%m-%d-%Y"
).call_method("replace", day=1),
Observed=c.col("Observed").as_type(float),
)
.into_iter_rows(dict)
)
# process them
result = converter(rows)

Related

Append multiple columns into two columns python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.
If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

Lag values and differences in pandas dataframe with missing quarterly data

Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')

Python - Lookup value from one table that falls within a range in a second table

I have two tables, one contains SCHEDULE_DATE (over 300,000 records) and WORK_WEEK_CODE, and the second table contains WORK_WEEK_CODE, START_DATE, and END_DATE. The first table has duplicate schedule dates, and the second table is 3,200 unique values. I need to populate the WORK_WEEK_CODE in table one with the WORK_WEEK_CODE from table two, based off of the range where the schedule date falls. Samples of the two tables are below.
I was able to accomplish the task using arcpy.da.UpdateCursor with a nested arcpy.da.SearchCursor, but with the volume of records, it takes a long time. Any suggestions on a better (and less time consuming) method would be greatly appreciated.
Note: The date fields are formatted as string
Table 1
SCHEDULE_DATE,WORK_WEEK_CODE
20160219
20160126
20160219
20160118
20160221
20160108
20160129
20160201
20160214
20160127
Table 2
WORK_WEEK_CODE,START_DATE,END_DATE
1601,20160104,20160110
1602,20160111,20160117
1603,20160118,20160124
1604,20160125,20160131
1605,20160201,20160207
1606,20160208,20160214
1607,20160215,20160221
You can use Pandas dataframes as a more efficient method. Here is the approach using Pandas. Hope this helps:
import pandas as pd
# First you need to convert your data to Pandas Dataframe I read them from csv
Table1 = pd.read_csv('Table1.csv')
Table2 = pd.read_csv('Table2.csv')
# Then you need to add a shared key for join
Table1['key'] = 1
Table2['key'] = 1
#The following line joins the two tables
mergeddf = pd.merge(Table1,Table2,how='left',on='key')
#The following line converts the string dates to actual dates
mergeddf['SCHEDULE_DATE'] = pd.to_datetime(mergeddf['SCHEDULE_DATE'],format='%Y%m%d')
mergeddf['START_DATE'] = pd.to_datetime(mergeddf['START_DATE'],format='%Y%m%d')
mergeddf['END_DATE'] = pd.to_datetime(mergeddf['END_DATE'],format='%Y%m%d')
#The following line will filter and keep only lines that you need
result = mergeddf[(mergeddf['SCHEDULE_DATE'] >= mergeddf['START_DATE']) & (mergeddf['SCHEDULE_DATE'] <= mergeddf['END_DATE'])]

Convert String With Comma To Number Using Python Pandas

I am generating a pivot table report using the pandas Python module. The source data includes a lot of readings measured in milliseconds. If the number of milliseconds exceeds 999 then the value in that CSV file will include commas (e.g. 1,234 = 1.234 seconds).
Here's how I'm trying to run the report:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
The problem is that the dtype for the Average Usage Time (ms) is an object so the np.max function just treats it like it's NaN. I therefore never see any values greater than 999.
I tried fixing the issue like this:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
# Convert strings to numbers if possible
pool_usage = pool_usage.convert_objects(convert_numeric=True)
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
This did actually change the dtype of the Average Usage Time column to a float but all of the values that are greater than 999 are still treated like NaN's.
How can I convert the Average Usage Time column to a float even though it's possible that some of the values may include commas?
The read_csv function takes an optional thousands argument. Its default is None so you can change it to "," to have it recognise 1,234 as 1234 when it reads the file:
pd.read_csv("c:/foo/ds-dump.csv", thousands=",")
The column holding the millisecond values should then have the int64 datatype once the file has been read into memory.

How do I find all rows with a certain date using Pandas?

I have a simple Pandas DataFrame containing columns 'valid_time' and 'value'. The frequency of the sampling is roughly hourly, but irregular and with some large gaps. I want to be able to efficiently pull out all rows for a given day (i.e. within a calender day). How can I do this using DataFrame.where() or something else?
I naively want to do something like this (which obviously doesn't work):
dt = datetime.datetime(<someday>)
rows = data.where( data['valid_time'].year == dt.year and
data['valid_time'].day == dt.day and
data['valid_time'].month == dt.month)
There's at least a few problems with the above code. I am new to pandas so am fumbling with something that is probably straightforward.
Pandas is absolutely terrific for things like this. I would recommend making your datetime field your index as can be seen here. If you give a little bit more information about the structure of your dataframe, I would be happy to include more detailed directions.
Then, you can easily grab all rows from a date using df['1-12-2014'] which would grab everything from Jan 12, 2014. You can edit that to get everything from January by using df[1-2014]. If you want to grab data from a range of dates and/or times, you can do something like:
df['1-2014':'2-2014']
Pandas is pretty powerful, especially for time-indexed data.
Try this (is just like the continuation of your idea):
import pandas as pd
import numpy.random as rd
import datetime
times = pd.date_range('2014/01/01','2014/01/6',freq='H')
values = rd.random_integers(0,10,times.size)
data = pd.DataFrame({'valid_time':times, 'values': values})
dt = datetime.datetime(2014,1,3)
rows = data['valid_time'].apply(
lambda x: x.year == dt.year and x.month==dt.month and x.day== dt.day
)
print data[rows]

Categories