I'm creating columns with past data from the same columns of the database. Like for the same day, I need the Y value of the day before and the same day of week in the week before. So:
x = df.copy()
x["Date"] = pd.to_datetime(df.rename(columns={"Año":"Year","Mes":"Month","Dia":"Day"})[["Year","Month","Day"]])
y = x[["Date","Y"]]
y.rename(columns={"Y":"Y_DiaAnterior"}, inplace=True)
y["Date"] = y["Date"] + dt.timedelta(days=1)
z = pd.merge(x,y,on=["Date"], how="left")
display(y.head()) # First merge result
a = x[["Date","Y"]]
a.rename(columns={"Y":"Y_DiaSemAnterior"}, inplace=True)
a["Date"] = a["Date"] + dt.timedelta(days=7)
z = pd.merge(x,a,on=["Date"], how="left")
z.head() # Second merge result
Where y df is an auxiliar df to create the column Y with last day data, and a df is an auxiliar df to create the column Y with same-day-last week data.
When I merge them separately it works perfectly, but when I want to merge all of them (first x with y and then x with a) the merge of x with y is 'deleted', as you can see that the Y_DiaAnterior columns is not in the final df (or 'second merge result') even when I already merged them.
First merge result
Second merge result
So, how can I do that the final df have Y_DiaAnterior and Y_DiaSemAnterior variables?
Because you're overwriting your z with the new merge of x and a. Also you aren't showing the results of the first merge in your code because you're using y.head().
If you want the merge results of all 3 df's, you can chain the merges:
# prep x
x = df.copy()
x["Date"] = pd.to_datetime(df.rename(columns={"Año":"Year", "Mes":"Month", "Dia":"Day"})[["Year", "Month", "Day"]])
# prep y
y = x[["Date", "Y"]].copy()
y.rename(columns={"Y":"Y_DiaAnterior"}, inplace=True)
y["Date"] = y["Date"] + dt.timedelta(days=1)
# prep a
a = x[["Date", "Y"]].copy()
a.rename(columns={"Y":"Y_DiaSemAnterior"}, inplace=True)
a["Date"] = a["Date"] + dt.timedelta(days=7)
# now merge all
z = x.merge(y, on='Date', how='left') \
.merge(a, on='Date', how='left')
Related
I am trying to rearrange a DataFrame. Currently, I have 1035 rows and 24 columns, one for each hour of the day. I want to make this a array with 1035*24 rows. If you want to see the data it can be extracted from the following JSON file:
url = "https://www.svk.se/services/controlroom/v2/situation?date={}&biddingArea=SE1"
svk = []
for i in parsing_range_svk:
data_json_svk = json.loads(urlopen(url.format(i)).read())
svk.append([v["y"] for v in data_json_svk["Data"][0]["data"]])
This is the code I am using to rearrange this data, but it is not doing the job. The first obeservation is in the right place, then it starts getting messy. I have not been able to figure out where each observation goes.
svk = pd.DataFrame(svk)
date_start1 = datetime(2020, 1, 1)
date_range1 = [date_start1 + timedelta(days=x) for x in range(1035)]
date_svk = pd.DataFrame(date_range1, columns=['date'])
svk['date'] = date_svk['date']
svk.drop(24, axis=1, inplace=True)
consumption_svk_1 = (svk.melt('date', value_name='SE1_C')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))
Steps till goal:
create a for loop to go through every row in the dataframe and:
take X and Y column values to use them in the function
the function will generate a Longitud and Latitud values
adding those values in the same row in new columns called "Lat" and "Lon"
At the moment, step 1 and 2 are working, but I can't get the append of every value in every column
What I have tried is:
Definition to use in the loop
def xy_to_lonlat(x, y):
proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
proj_xy = pyproj.Proj(proj="utm", zone=28, datum='WGS84')
lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
print( lonlat[1],lonlat[0])
Loop:
for _, row in df.iterrows():
xy_to_lonlat(row['X'],row['Y'])
This is the output, that is perfect:
28.667978631874004 -17.96430510323817
28.67957708337043 -17.96589718293177
28.680075373251725 -17.96652237896143
28.696094952446764 -17.971279315586795
But I need to introduce these 2 values into df, exactly into df['Lat'] and df['Lon']
What I have tried is to append() them in lists that later I will insert into df, but it doesn't work:
aLongitud=[]
aLatitud=[]
for _, row in df.iterrows():
xy_to_lonlat(row['X'],row['Y'])
aLongitud.append(lonlat[1])
aLatitud.append(lonlat[0])
This is how df looks like:
The function works with the 52 rows, I just need to get them into 2 new columns in the df:
28.667978631874004 -17.96430510323817
28.67957708337043 -17.96589718293177
28.680075373251725 -17.96652237896143
28.696094952446764 -17.971279315586795
28.69709953128404 -17.97089438970623
28.704102246479206 -17.97502030269029
28.714190480593878 -17.98059681820521
28.84284299081375 -17.943724718418043
28.85522495646711 -17.907748758676934
28.85497605095961 -17.915999785074945
28.834039353212727 -17.853402778875363
28.84368320877517 -17.790724992980966
28.8311955800612 -17.773218425619255
28.757725903465193 -17.735394629644425
28.75694932761218 -17.734865031953948
28.651232614536056 -17.75864104734293
28.647850336922037 -17.75586691138396
28.64510111053916 -17.756973867003158
28.54740295444906 -17.779646961686794
28.481011316595747 -17.871383348460515
28.598084805574075 -17.92779850800547
28.84869842152646 -17.898800401690675
28.730123181880874 -17.72687292142767
28.65501749037169 -17.759807688028065
28.586115587686052 -17.755714748146353
28.855549587948108 -17.90757529900783
28.62104314133748 -17.750679106650242
28.805231369924527 -17.76049570914483
28.842322764567797 -17.794590436117428
28.654662237239517 -17.761368473029265
28.652716177555675 -17.954686156568993
28.84441637529699 -17.789637146820752
28.812367721581616 -17.763087214328706
28.80648375432461 -17.75977264125206
28.713070037952928 -17.74394044409638
28.850159557661478 -17.898032389327415
28.84268417328949 -17.884610248902643
28.506075965709968 -17.87932721318885
28.60916367244466 -17.92715257476472
28.508055636889907 -17.879126662123344
28.593688218530882 -17.755496249789623
28.614870490264675 -17.753636080872226
28.453393338804933 -17.83975500058191
28.81927942283548 -17.97071265399719
28.632049774803967 -17.948276230580895
28.810197401802437 -17.7626526992656
28.81013751332894 -17.762176710792335
28.651195000175182 -17.757862000230173
28.491243000164914 -17.874658000300624
28.523693000166094 -17.87819700030406
28.56082500016691 -17.89452700031706
28.53126600016634 -17.878297000304332
How the df looks after looping the function >> The "none" issue:
enter image description here
This solution uses your existing xy_tolonlat() function with the pandas DataFrame apply method:
def xy_to_lonlat(x, y):
proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
proj_xy = pyproj.Proj(proj="utm", zone=28, datum='WGS84')
lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
return lonlat[1],lonlat[0]
# I just made up this data
xs = [21000,21020,23000]
ys = [3000000,3000050,3000100]
df = pd.DataFrame({'X':xs,'Y':ys})
df['lat_lon'] = df.apply(lambda r: xy_to_lonlat(r['X'],r['Y']),axis=1)
df['Lat'] = df['lat_lon'].apply(lambda x: x[0])
df['Lon'] = df['lat_lon'].apply(lambda x: x[1])
df = df.drop('lat_lon',axis=1)
df
# X Y Lat Lon
# 0 21000 3000000 27.039540 -19.826207
# 1 21020 3000050 27.039996 -19.826026
# 2 23000 3000100 27.041129 -19.806152
I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?
Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.
IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]
You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result
Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.
with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.
I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.
I am working with stock data and I want to make my data sets have equal length of data when performing certain types of analysis.
Problem
If I a load data for Apple I will get daily data since 1985 but if load data for a Natural Gas ETF it might only go as far back as 2012. I now want to filter Apple to only show history going back to 2012. Also, the end date, for example some of my dataset may not be up to date as Apple data is ranging from 1985 to 1-20-17 and the Natural Gas ETF data has a range of 2012 to 12-23-16. I also want another filter that sets the max date. So now my apple data set is filtered for dates ranging between 2012 to 12-23-16. Now my datasets are equal.
Approach
I have a dictionary called Stocks which stores all of my dateframes. All the dataframes have a column named D which is the Date column.
I wrote a function that populates a dictionary with the dataframes and also takes the min and max dates for each df. I store all those min max dates in two other dictionaries DatesMax and DateMin and then take the min and the max of those two dictionaries to get the max and the min dates that will be used for the filter value on all the dataframes.
The function below works, it gets the min and max dates of multiple dataframes and returns them in a dictionary named DatesMinMax.
def MinMaxDates (FileName):
DatesMax = {}; DatesMin = {}
DatesMinMax = {}; stocks = {}
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
for i in tickers:
a = '/' in i
if a == True:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
else:
df = pd.read_csv(i + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
x = min(DatesMax.values())
y = max(DatesMin.values())
DatesMinMax = {'MaxDate' : x, 'MinDate' : y}
return DatesMinMax
print DatesMinMax
# {'MinDate': '2012-02-08', 'MaxDate': '2017-01-20'}
Question
Now, I will have to run my loop on all the dataframes in the dict name Stocks to filter there date columns. It seems inefficient to re-loop something again, but I can't think of any other other way to apply the filter.
Actually, you may not need to capture min and max (since 2016-12-30 < 2017-01-20) for later filtering, but simply run a full inner join merge across all dataframes on 'D' (Date) column.
Consider doing so with a chain merge which ensures equal lengths across all dataframe, and then slice this outputted master dataframe by ticker columns to build the Stocks dictionary. Of course, you can use the wide master dataframe for analysis:
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
# DATA FRAME LIST BUILD
dfs = []
for i in tickers:
if '/' in i:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
else:
df = pd.read_csv(i + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
# CHAIN MERGE (INNER JOIN) ACROSS ALL DFS
masterdf = reduce(lambda left,right: pd.merge(left, right, on=['D']), dfs)
# DATA FRAME DICT BUILD
stocks = {}
for i in tickers:
# SLICE CURRENT TICKER COLUMNS
df = masterdf[['D']+[col for col in df.columns if i in col]]
# REMOVE TICKER PREFIXES
df.columns = [col.replace(i+'_', '') for col in df.columns]
stocks[i] = df