nested for loops, using values to create columns - python

I'm pretty new to python programming. I read a csv file to a dataframe with median house price of each month as columns. Now I want to create columns to get the mean value of each quarter. e.g. create column housing['2000q1'] as mean of 2000-01, 2000-02, and 2000-03, column housing['2000q2'] as mean of 2000-04,2000-05, 2000-06]...
raw dataframe named 'Housing'
I tried to use nested for loops as below, but always come with errors.
for i in range (2000,2017):
for j in range (1,5):
Housing[i 'q' j] = Housing[[i'-'j*3-2, i'-'j*3-1, i'_'j*3]].mean(axis=1)
Thank you!

Usually, we work with data where the rows are time, so it's good practice to do the same and transpose your data by starting with df = Housing.set_index('CountyName').T (also, variable names should usually start with a small letter, but this isn't important here).
Since your data is already in such a nice format, there is a pragmatic (in the sense that you need not know too much about datetime objects and methods) solution, starting with df = Housing.set_index('CountyName').T:
df.reset_index(inplace = True) # This moves the dates to a column named 'index'
df.rename(columns = {'index':'quarter'}, inplace = True) # Rename this column into something more meaningful
# Rename the months into the appropriate quarters
df.quarter.str.replace('-01|-02|-03', 'q1', inplace = True)
df.quarter.str.replace('-04|-05|-06', 'q2', inplace = True)
df.quarter.str.replace('-07|-08|-09', 'q3', inplace = True)
df.quarter.str.replace('-10|-11|-12', 'q4', inplace = True)
df.drop('SizeRank', inplace = True) # To avoid including this in the calculation of means
c = df.notnull().sum(axis = 1) # Count the number of non-empty entries
df['total'] = df.sum(axis = 1) # The totals on each month
df['c'] = c # only ssign c after computing the total, so it doesn't intefere with the total column
g = df.groupby('quarter')[['total','c']].sum()
g['q_mean'] = g['total']/g['c']
g
g['q_mean'] or g[['q_mean']] should give you the required answer.
Note that we needed to compute the mean manually because you had missing data; otherwise, df.groupby('quarter').mean().mean() would have immediately given you the answer you needed.
A remark: the technically 'correct' way would be to convert your dates into a datetime-like object (which you can do with the pd.to_datetime() method), then run a groupby with a pd.TimeGrouper() argument; this would certainly be worth learning more about if you are going to work with time-indexed data a lot.

You can achieve this using pandas resampling function to compute quarterly averages in a very simple way.
pandas resampling: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
offset names summary: pandas resample documentation
In order to use this function, you need to have only time as columns, so you should temporarily set CountryName and SizeRank as indexes.
Code:
QuarterlyAverage = Housing.set_index(['CountryName', 'SizeRank'], append = True)\
.resample('Q', axis = 1).mean()\
.reset_index(['CountryName', 'SizeRank'], drop = False)
Thanks to #jezrael for suggesting axis = 1 in resampling

Related

How to loop through a pandas data frame using a columns values as the order of the loop?

I have two CSV files which I’m using in a loop. In one of the files there is a column called "Availability Score"; Is there a way that I can make the loop iterate though the records in descending order of this column? I thought I could use Ob.sort_values(by=['AvailabilityScore'],ascending=False) to change the order of the dataframe first, so that when the loop starts in will already be in the right order. I've tried this out and it doesn’t seem to make a difference.
# import the data
CF = pd.read_csv (r'CustomerFloat.csv')
Ob = pd.read_csv (r'Orderbook.csv')
# Convert to dataframes
CF = pd.DataFrame(CF)
Ob = pd.DataFrame(Ob)
#Remove SubAssemblies
Ob.drop(Ob[Ob['SubAssembly'] != 0].index, inplace = True)
#Sort the data by thier IDs
Ob.sort_values(by=['CustomerFloatID'])
CF.sort_values(by=['FloatID'])
#Sort the orderbook by its avalibility score
Ob.sort_values(by=['AvailabilityScore'],ascending=False)
# Loop for Urgent Values
for i, rowi in CF.iterrows():
count = 0
urgent_value = 1
for j, rowj in Ob.iterrows():
if(rowi['FloatID']==rowj['CustomerFloatID'] and count < rowi['Urgent Deficit']):
Ob.at[j,'CustomerFloatPriority'] = urgent_value
count+= rowj['Qty']
You need to add inplace=True, like this:
Ob.sort_values(by=['AvailabilityScore'],ascending=False, inplace=True)
sort_values() (like most Pandas functions nowadays) are not in-place by default. You should assign the result back to the variable that holds the DataFrame:
Ob = Ob.sort_values(by=['CustomerFloatID'], ascending=False)
# ...
BTW, while you can pass inplace=True as argument to sort_values(), I do not recommend it. Generally speaking, inplace=True is often considered bad practice.

Using pandas .shift on multiple columns with different shift lengths

I have created a function that parses through each column of a dataframe, shifts up the data in that respective column to the first observation (shifting past '-'), and stores that column in a dictionary. I then convert the dictionary back to a dataframe to have the appropriately shifted columns. The function is operational and takes about 10 seconds on a 12x3000 dataframe. However, when applying it to 12x25000 it is extremely extremely slow. I feel like there is a much better way to approach this to increase the speed - perhaps even an argument of the shift function that I am missing. Appreciate any help.
def create_seasoned_df(df_orig):
"""
Creates a seasoned dataframe with only the first 12 periods of a loan
"""
df_seasoned = df_orig.reset_index().copy()
temp_dic = {}
for col in cols:
to_shift = -len(df_seasoned[df_seasoned[col] == '-'])
temp_dic[col] = df_seasoned[col].shift(periods=to_shift)
df_seasoned = pd.DataFrame.from_dict(temp_dic, orient='index').T[:12]
return df_seasoned
Try using this code with apply instead:
def create_seasoned_df(df_orig):
df_seasoned = df_orig.reset_index().apply(lambda x: x.shift(df_seasoned[col].eq('-').sum()), axis=0)
return df_seasoned

Python list comparison numpy optimization

I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)

Create a Pandas daily aggregate time series from a DataFrame with date ranges

I have a Pandas DataFrame of subscriptions, each with a start datetime (timestamp) and an optional end datetime (if they were canceled).
For simplicity, I have created string columns for the date (e.g. "20170901") based on start and end datetimes (timestamps). It looks like this:
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None), ...], columns=["sd", "ed"])
The end result should be a time series of how many subscriptions were active on any given date in a range.
To that end, I created an Index for all the days within a range:
days = df.groupby(["sd"])["sd"].count()
I am able to create what I am interested in with a loop each executing a query over the entire DataFrame df.
count_by_day = pd.DataFrame([
len(df.loc[(df.sd <= i) & (df.ed.isnull() | (df.ed > i))])
for i in days.index], index=days.index)
Note that I have values for each day in the original dataset, so there are no gaps. I'm sure getting the date range can be improved.
The actual question is: is there an efficient way to compute this for a large initial dataset df, with multiple thousands of rows? It seems the method I used is quadratic in complexity. I've also tried df.query() but it's 66% slower than the Pythonic filter and does not change the complexity.
I tried to search the Pandas docs for examples but I seem to be using the wrong keywords. Any ideas?
It's an interesting problem, here's how I would do it. Not sure about performance
EDIT: My first answer was incorrect, I didn't read fully the question
# Initial data, columns as Timestamps
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None)], columns=["sd", "ed"])
df['sd'] = pd.DatetimeIndex(df.sd)
df['ed'] = pd.DatetimeIndex(df.ed)
# Range input and related index
beg = pd.Timestamp('2017-05-15')
end = pd.Timestamp('2017-09-15')
idx = pd.DatetimeIndex(start=beg, end=end, freq='D')
# We filter data for records out of the range and then clip the
# the subscriptions start/end to the range bounds.
fdf = df[(df.sd <= beg) | ((df.ed >= end) | (pd.isnull(df.ed)))]
fdf['ed'].fillna(end, inplace=True)
fdf['ps'] = fdf.sd.apply(lambda x: max(x, beg))
fdf['pe'] = fdf.ed.apply(lambda x: min(x, end))
# We run a conditional count
idx.to_series().apply(lambda x: len(fdf[(fdf.ps<=x) & (fdf.pe >=x)]))
Ok, I'm answering my own question after quite a bit of research, fiddling and trying things out. I may still be missing an obvious solution but maybe it helps.
The fastest solution I could find to date is (thanks Alex for some nice code patterns):
# Start with test data from question
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'),
('20170901', None), ...], columns=['sd', 'ed'])
# Convert to datetime columns
df['sd'] = pd.DatetimeIndex(df['sd'])
df['ed'] = pd.DatetimeIndex(df['ed'])
df.ed.fillna(df.sd.max(), inplace=True)
# Note: In my real data I have timestamps - I convert them like this:
#df['sd'] = pd.to_datetime(df['start_date'], unit='s').apply(lambda x: x.date())
# Set and sort multi-index to enable slices
df = df.set_index(['sd', 'ed'], drop=False)
df.sort_index(inplace=True)
# Compute the active counts by day in range
di = pd.DatetimeIndex(start=df.sd.min(), end=df.sd.max(), freq='D')
count_by_day = di.to_series().apply(lambda i: len(df.loc[
(slice(None, i.date()), slice(i.date(), None)), :]))
In my real dataset (with >10K rows for df and a date range of about a year), this was twice as fast as the code in the question, about 1.5s.
Here some lessons I learned:
Creating a Series with counters for the date range and iterating through the dataset df with df.apply or df.itertuples and incrementing counters was much slower. Curiously, apply was slower than itertuples. Don't even think of iterrows
My dataset had a product_id with each row, so filtering the dataset for each product and running the calculation on the filtered result (for each product) was twice as fast as adding the product_id to the multi-index and slicing on that level too
Building an intermediate Series of active days (from iterating through each row in df and adding each date in the active range to the Series) and then grouping by date was much slower.
Running the code in the question on a df with a multi-index did not change the performance.
Running the code in the question on a df with a limited set of columns (my real dataset has 22 columns) did not change the performance.
I was looking at pd.crosstab and pd.Period but I was not able to get anything to work
Pandas is pretty awesome and trying to outsmart it is really hard (esp. non-vectorized in Python)

Filling missing data with different methods

I have a couple of a set of data with timestamp, value and quality flag. The value and quality flag are missing for some of the timestamps, and needs to be filled with a dependence on the surrounding data. I.e.,
If the quality flags on the valid data bracketing the NaN data are different, then set the value and quality flag to the same as the bracketing row with the highest quality flag. In the example below, the first set of NaNs would be replaced with qf=3 and value=3.
If the quality flags are the same, then interpolate the value between the two valid values on either side. In the example, the second set of NaNs would be replaced by qf = 1 and v = 6 and 9.
Code:
import datetime
import pandas as pd
start = datetime.strptime("2004-01-01 00:00","%Y-%m-%d %H:%M")
end = datetime.strptime("2004-01-01 03:00","%Y-%m-%d %H:%M")
df = pd.DataFrame(\
data = {'v' : [1,2,'NaN','NaN','NaN',3,2,1,5,3,'NaN','NaN',12,43,23,12,32,12,12],\
'qf': [1,1,'NaN','NaN','NaN',3,1,5,1,1,'NaN','NaN',1,3,4,2,1,1,1]},\
index = pd.date_range(start, end,freq="10min"))
I have tried to solve this by finding the NA rows and looping through them, to fix the first criteron, then using interpolate to solve the second. However, this is really slow as I am working with a large set.
One approach would just be to do all the possible fills and then choose among them as appropriate. After doing df = df.astype(float) if necessary (your example uses the string "NaN"), something like this should work:
is_null = df.qf.isnull()
fill_down = df.ffill()
fill_up = df.bfill()
df.loc[is_null & (fill_down.qf > fill_up.qf)] = fill_down
df.loc[is_null & (fill_down.qf < fill_up.qf)] = fill_up
df = df.interpolate()
It does more work than is necessary, but it's easy to see what it's doing, and the work that it does do is vectorized and so happens pretty quickly. On a version of your dataset expanded to be ~10M rows (with the same density of nulls), it takes ~6s on my old notebook. Depending on your requirements that might suffice.

Categories