Iterate over dates pandas - python

I have a pandas dataframe:
CITY DT Error%
1 A 1/1/2020 0.03436722
2 A 1/2/2020 0.03190177
3 B 1/9/2020 0.040218757
4 B 1/8/2020 0.098921665
I want to iterate through the dataframe and check if the DT and its next week DT have a ERROR % of less than 0.05.
I want the return to be the dataframe series
2 A 1/2/2020 0.03190177
3 B 1/9/2020 0.040218757

IIUC,
df['DT'] = pd.to_datetime(df['DT'])
idx = df[df['DT'].sub(df['DT'].shift()).gt('6 days')].index.tolist()
indices = []
for i in idx:
indices.append(i-1)
indices.append(i)
print(df.loc[df['Error%'] <= 0.05].loc[indices])
CITY DT Error%
2 A 2020-01-02 0.031902
3 B 2020-01-09 0.040219

Not particularly elegant, but it gets the job done and maybe some of the professionals here can improve on it:
First, merge the information for the day with the info for the day a week after by performing a self-join on the time-shifted DT column. we can use an inner join since we're only interested in rows that have an entry for the week after:
tmp = df.set_index(df.DT.apply(lambda x: x + pd.Timedelta('7 days'))) \
.join(df.set_index('DT'), lsuffix='_L', how='inner')
Then select the date column for those entries where both error margins are satisfied:
tmp = tmp.DT.loc[(tmp['Error%_L'] < 0.05) & (tmp['Error%'] < 0.05)]
tmp is now a pd.Series with information in the index (the shifted values) and the values (the first week). Since both dates are wanted in the output, compile the "index dates" by taking the unique values among all of them:
idx = list(set(tmp.tolist() + tmp.index.tolist()))
And finally, grab the corresponding rows from the original dataframe:
df.set_index('DT').loc[idx].reset_index()
This, however loses you the original row number. If that is needed, you'll have to save it to a column first and reset the index back to that variable after selecting the relevant rows

Related

How to compare 2 data frames in python and highlight the differences?

I am trying to compare 2 files one is in xls and other is in csv format.
File1.xlsx (not actual data)
Title Flag Price total ...more columns
0 A Y 12 300
1 B N 15 700
2 C N 18 1000
..
..
more rows
File2.csv (not actual data)
Title Flag Price total ...more columns
0 E Y 7 234
1 B N 16 600
2 A Y 12 300
3 C N 17 1000
..
..
more rows
I used Pandas and moved those files to data frame. There is no unique columns(to make id) in the files and there are 700K records to compare. I need to compare File 1 with File 2 and show the differences. I have tried few things but I am not getting the outliers as expected.
If I use merge function as below, I am getting output with the values only for File 1.
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
output I am getting
Title Attention_Needed Price total
1 B N 15 700
2 C N 18 1000
This output is not showing the correct diff as record with Title 'E' is missing
I also tried using panda merge
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
& output for above was
Title Flag Price total Exist
0 A Y 12 300 both
1 B N 15 700 left_only
2 C N 18 1000 left_only
3 E Y 7 234 right_only
4 B N 16 600 right_only
5 C N 17 1000 right_only
Problem with above output is it is showing records from both the data frames and it will be very difficult if there are 1000 of records in each data frame.
Output I am looking for (for differences) by adding extra column("Comments") and give message as matching, exact difference, new etc. or on the similar lines
Title Flag Price total Comments
0 A Y 12 300 matching
1 B N 15 700 Price, total different
2 C N 18 1000 Price different
3 E Y 7 234 New record
If above output can not be possible, then please suggest if there is any other way to solve this.
PS: This is my first question here, so please let me know if you need more details here.
Rows in DF1 Which Are Not Available in DF2
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
Rows in DF2 Which Are Not Available in DF1
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
If you're differentiating by row not column
pd.concat([df1,df2]).drop_duplicates(keep=False)
If each df has the same columns and each column should be compared individually
for col in data.columns:
set(df1.col).symmetric_difference(df2.col)
# WARNING: this way of getting column diffs likely won't keep row order
# new row order will be [unique_elements_from_df1_REVERSED] concat [unique_elements_from_df2_REVERSED]
lets assume df1 (left) is our "source of truth" for what's considered an original record.
after running
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
take the output and split it into 2 df's.
df1 = diff_df[diff_df["Exist"] in ["both", "left_only"]]
df2 = diff_df[diff_df["Exist"] == "right_only"]
Right now, if you drop the "exist" row from df1, you'll have records where the comment would be "matching".
Let's assume you add the 'comments' column to df1
you could say that everything in df2 is a new record, but that would disregard the "price/total different".
If you really want the difference comment, now is a tricky bit where the 'how' really depends on what order columns matter most (title > flag > ...) and how much they matter (weighting system)
After you have a wighting system determined, you need a 'scoring' method that will compare two rows in order to see how similar they are based on the column ranking you determine.
# distributes weight so first is heaviest, last is lightest, total weight = 100
# if i was good i'd do this with numpy not manually
def getWeights(l):
weights = [0 for col in l]
total = 100
while total > 0:
for i, e in enumerate(l):
for j in range(i+1):
weights[j] += 1
total -= 1
return weights
def scoreRows(row1, row2):
s = 0
for i, colName in enumerate(colRank):
if row1[colName] == row2[colName]:
s += weights[i]
colRank = ['title', 'flag']
weights = getWeights(colRank)
Let's say only these 2 matter and the rest are considered 'modifications' to an original row
That is to say, if a row in df2 doesn't have a matching title OR flag for ANY row in df1, that row is a new record
What makes a row a new record is completely up to you.
Another way of thinking about it is that you need to determine what makes some row in df2 'differ' from some row in df1 and not a different row in df1
if you have 2 rows in df1
row1: [1, 2, 3, 4]
row2: [1, 6, 3, 7]
and you want to compare this row against that df
[1, 6, 5, 4]
this row has the same first element as both, the same second element as row2, and the same 4th element of row1.
so which row does it differ from?
if this is a question you aren't sure how to answer, consider cutting losses and just keep df1 as "good" records and df2 as "new" records
if you're sticking with the 'differs' comment, our next step is to filter out truly new records from records that have slight differences by building a score table
# to recap
# df1 has "both" and "left_only" records ("matching" comment)
# df2 has "right_only" records (new records and differing records)
rowScores = []
# list of lists
# each inner list index correlates to the index for df2
# inner lists are
# made up of tuples
# each tuple first element is the actual row from df1 that is matched
# second element is the score for matching (out of 100)
for i, row1 in df2.itterrows():
thisRowsScores = []
#df2 first because they are what we are scoring
for j, row2 in df1.iterrows():
s = scoreRows(row1, row2)
if s>0: # only save rows and scores that matter
thisRowsScores.append((row2, s))
# at this point, you can either leave the scoring as a table and have comments refer how different differences relate back to some row
# or you can just keep the best score like i'll be doing
#sort by score
sortedRowScores = thisRowsScores.sort(key=lambda x: x[1], reverse=True)
rowScores.append(sortedRowScores[0])
# appends empty list if no good matches found in df1
# alternatively, remove 'reversed' from above and index at -1
The reason we save the row itself is so that it can be indexed by df1 in order to add a "differ" comments
At this point, lets just say that df1 already has the comments "matching" added to it
Now that each row in df2 has a score and reference to the row it matched best in df1, we can edit the comment to that row in df1 to list the columns with different values.
But at this point, I feel as though that df now needs a reference back to df2 so that the record and values those difference refer to are actually gettable.

Groupby count per category per month (Current month vs Remaining past months) in separate columns in pandas

Lets say I have the following dataframe:
I am trying to get something like this.
I was thinking to maybe use the rolling function and have separate dataframes for each count type(current month and past 3 months) and then merge them based on ID.
I am new to python and pandas so please bear with me if its a simple question. I am still learning :)
EDIT:
#furas so I started with calculating cumulative sum for all the counts as separate columns
df['f_count_cum] = df.groupby(["ID"])['f_count'].transform(lambda x:x.expanding().sum())
df['t_count_cum] = df.groupby(["ID"])['t_count'].transform(lambda x:x.expanding().sum())
and then just get the current month df by
df_current = df[df.index == (max(df.index)]
df_past_month = df[df.index == (max(df.index - 1)]
and then just merge the two dataframes based on the ID ?
I am not sure if its correct but this is my first take on this
Few assumptions looking at the input sample:
Month index is of datetime64[ns] type. If not, please use below to typecast the datatype.
df['Month'] = pd.to_datetime(df.Month)
Month column is the index. If not, please set it as index.
df = df.set_index('Month')
Considering last month of the df as current month and first 3 months as 'past 3 months'. If not modify the last and first function accordingly in df1 and df2 respectively.
Code
df1 = df.last('M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'})
df2 = df.first('3M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})
df = pd.merge(df1, df2, on='ID', how='inner').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'
])
Output
ID f_count(current month) f_count(past 3 months) t_count(current month) t_count(past 3 months)
0 A 3 13 8 14
1 B 3 5 7 5
2 C 1 3 2 4
Another version of same code, if you prefer function and single statement
def get_df(freq):
if freq=='M':
return df.last('M').groupby('ID').sum().reset_index()
return df.first('3M').groupby('ID').sum().reset_index()
df = pd.merge(get_df('M').rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'}),
get_df('3M').rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'}),
on='ID').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'])
EDIT:
For previous two months from current month:(we can use different combinations of first and last function as per our need)
df2 = df.last('3M').first('2M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})

Accessing Pandas groupby() function

I have the below data frame with me after doing the following:
train_X = icon[['property', 'room', 'date', 'month', 'amount']]
train_frame = train_X.groupby(['property', 'month', 'date', 'room']).median()
print(train_frame)
amount
property month date room
1 6 6 2 3195.000
12 3 2977.000
18 2 3195.000
24 3 3581.000
36 2 3146.000
3 3321.500
42 2 3096.000
3 3580.000
54 2 3195.000
3 3580.000
60 2 3000.000
66 3 3810.000
78 2 3000.000
84 2 3461.320
3 2872.800
90 2 3461.320
3 3580.000
96 2 3534.000
3 2872.800
102 3 3581.000
108 3 3580.000
114 2 3195.000
My objective is to track the median amount based on the (property, month, date, room)
I did this:
big_list = [[property, month, date, room], ...]
test_list = [property, month, date, room]
if test_list == big_list:
#I want to get the median amount wrt to that row which matches the test_list
How do I do this?
What I did is, tried the below...
count = 0
test_list = [2, 6, 36, 2]
for j in big_list:
if test_list == j:
break
count += 1
Now, after getting the count how do I access the median amount by count in dataframe? Is their a way to access dataframe by index?
Please note:
big_list is the list of lists where each list is [property, month, date, room] from the above dataframe
test_list is an incoming list to be matched with the big_list in case it does.
Answering the last question:
Is their a way to access dataframe by index?
Of course there is - you should use df.iloc or loc
depends if you want to get purerly by integer (I guess this is the situation) - you should use 'iloc' or by for example string type index - then you can use loc.
Documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
Edit:
Coming back to the question.
I assume that 'amount' is your searched median, then.
You can use reset_index() method on grouped dataframe, like
train_frame_reset = train_frame.reset_index()
and then you can again access your column names, so you should be albe to do the following (assuming j is index of found row):
train_frame_reset.iloc[j]['amount'] <- will give you median
If I understand your problem correctly you don't need to count at all, you can access the values via loc directly.
Look at:
A=pd.DataFrame([[5,6,9],[5,7,10],[6,3,11],[6,5,12]],columns=(['lev0','lev1','val']))
Then you did:
test=A.groupby(['lev0','lev1']).median()
Accessing, say, the median to the group lev0=6 and lev1 =1 can be done via:
test.loc[6,5]

Find difference between many datetime columns in pandas

I have a Dataframe with 282 columns, and 14000 rows. It looks as follows:
0 1 ... 282
uref_fixed
0006d730f5aa8492a59150e35bca5cc6 3/26/2018 7/3/2018 ...
00076311c47c44c33ffb834b1cebf5db 5/13/2018 5/13/2018 ...
0009ba8a69924902a9692c5f3aacea7f 7/13/2018 None ...
000dccb863b913226bca8ca636c9ddce 11/5/2017 11/10/2017 ...
I am trying to end up with a column at index 0 which, for each row, shows the average of the difference between each consecutive date value in each row (ie. difference date in column 2 and 3, then difference 3 and 4, then difference 4 and 5 etc., and then the average of all these)
Please note, there can be up to 282 date values in a row, but as you can see many have less.
Cheers
from datetime import datetime as dt
#df is your dataframe, df2 is a new one you have to initialize as empty
def diffdate(df, col1, col2):
if df[col1]==None or df[col2]==None:
return None
date1 = [int(i) for i in df[col1].split('/')]
date2 = [int(i) for i in df[col2].split('/')]
return (dt(date2[2],date2[0],date2[1]) - dt(date1[2],date1[0],date1[1])).days
for i in range(len(df.columns)-1):
df2[i] = df.apply(lambda x: diffdate(df, i, i+1),axis = 1)
df2 will hold all of the consecutive pair differences. Averaging the rows after this is pretty simple.

Aggregating unbalanced panel to time series using pandas

I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37

Categories