How to combine/merge multiple DataFrameGroupBy in pandas - python

I have 2 <pandas.core.groupby.DataFrameGroupBy> objects and would like to combine them by a key? How would I do that?. Having 'as_index=False' does not work(it used to work before) I tried the following
result = pd.merge(groupobject_a, groupobject_b, on='important_key', how='inner')
But I am getting error below
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.groupby.DataFrameGroupBy'>
Here is my minimal code how I created my groupby objects
import pandas as pd
my_dataframe = pd.read_csv("here is my csv")
groupobject_a= my_dataframe[(my_dataframe['colA'] > 0) & (my_dataframe['colB'] < 15) & (my_dataframe['colC'].notnull())].groupby(['important_key'], as_index=False)
groupobject_b= my_dataframe[(my_dataframe['colA'] > 25) & (my_dataframe['colB'] < 65) & (my_dataframe['colC'].notnull())].groupby(['important_key'], as_index=False)

Related

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!
The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])
To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

Python Series - if it is between the values of two other series (time)

I have a dataframe with datetime (df1). I want to know if the datetime in a 'col1' in df1 is between any of the pair of the datetime of two columns ('lowerbound' and 'upperbound') in another dataframe (df2).
For example:
df1 = pd.to_datetime(['2014-04-09 07:37:00','2015-04-09 07:00:00',
'2014-02-02 08:31:00','2014-03-02 08:22:00'])
df1 = pd.DataFrame(df1,columns = ['col1'])
lowerbound = pd.to_datetime(['2014-04-09 07:25:00','2014-02-02 08:30:00',
'2015-04-09 06:00:00','2014-03-02 08:12:00'])
upperbound = pd.to_datetime(['2014-04-09 07:38:00','2014-04-09 07:48:00',
'2015-04-09 08:00:00','2014-02-02 08:33:00')
df2 = pd.DataFrame(lowerbound,columns = ['lowerbound'])
df2['upperbound'] = upperbound
The result shall be [1,1,0,0] since:
df1['col1'][0] is between df2['lowerbound'][0] & df2['lowerbound'][0]
df1['col1'][1] is between df2['lowerbound'][2] & df2['lowerbound'][2]
Although df1['col1'][2] is between df2['lowerbound'][1] & df2['lowerbound'][3], the index for df2['lowerbound'] and df2['lowerbound'] are not the same.
Thanks!
you can use np.greater_than and np.less_than with outer and any in axis=1, such as:
import numpy as np
print ((np.greater_equal.outer(df1.col1, df2.lowerbound)
& np.less_equal.outer(df1.col1, df2.upperbound))
.any(1).astype(int))
here is gives with your data [1 1 1 1]
I believe you need apply in this case
df1.col1.apply(lambda dat: ((dat>= df2.lowerbound) & (dat <= df2.upperbound)).any())

numpy under a groupby not working

I have the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/test.csv')
df.drop(['SecurityID'],1,inplace=True)
Time = 1
trade_filter_size = 9
groupbytime = (str(Time) + "min")
df['dateTime_s'] = df['dateTime'].astype('datetime64[s]')
df['dateTime'] = pd.to_datetime(df['dateTime'])
df[str(Time)+"min"] = df['dateTime'].dt.floor(str(Time)+"min")
df['tradeBid'] = np.where(((df['tradePrice'] <= df['bid1']) & (df['isTrade']==1)), df['tradeVolume'], 0)
groups = df[df['isTrade'] == 1].groupby(groupbytime)
print("groups",groups.dtypes)
#THIS IS WORKING
df_grouped = (groups.agg({
'tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum())],
}))
# creating a new data frame which is filttered
df2 = pd.DataFrame( df.loc[(df['isTrade'] == 1) & (df['tradeVolume']>=trade_filter_size)])
#recalculating all the bid/ask volume to be bsaed on the filter size
df2['tradeBid'] = np.where(((df2['tradePrice'] <= df2['bid1']) & (df2['isTrade']==1)), df2['tradeVolume'], 0)
df2grouped = (df2.agg({
# here is the problem!!! NOT WORKING
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
}))
The same function is used tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum()). In the first time it's working ok but when doing it on filtered data in a new df it's causing an error:
ValueError: downticks_number is an unknown string function
when I use this code instead to solve the above
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
I get this error:
ValueError: cannot combine transform and aggregation operations
Any idea why I get different results for the same usage of code?
since there were 2 conditions to match for the 2nd groupby, I solved this by moving the filter into the df by creating a new column which is used as a filter (with both 2 filters).
than there was no problem to groupby
the order was the problem

Create dictionary of dataframe in pyspark

I am trying to create a dictionary for year and month. Its a kind of macro which i can call over required no. of year and month. I am facing challenge while adding dynamic column in pyspark df
df = spark.createDataFrame([(1, "foo1",'2016-1-31'),(1, "test",'2016-1-31'), (2, "bar1",'2012-1-3'),(4, "foo2",'2011-1-11')], ("k", "v","date"))
w = Window().partitionBy().orderBy(col('date').desc())
df = df.withColumn("next_date",lag('date').over(w).cast(DateType()))
df = df.withColumn("next_name",lag('v').over(w))
df = df.withColumn("next_date",when(col("k") != lag(df.k).over(w),date_add(df.date,605)).otherwise(col('next_date')))
df = df.withColumn("next_name",when(col("k") != lag(df.k).over(w),"").otherwise(col('next_name')))
import copy
dict_of_YearMonth = {}
for yearmonth in [200901,200902,201605 .. etc]:
key_name = 'Snapshot_'+str(yearmonth)
dict_of_YearMonth[key_name].withColumn("test",yearmonth)
dict_of_YearMonth[key_name].withColumn("test_date",to_date(''+yearmonth[:4]+'-'+yearmonth[4:2]+'-1'+''))
# now i want to add a condition
if(dict_of_YearMonth[key_name].test_date >= dict_of_YearMonth[key_name].date) and (test_date <= next_date) then output snapshot_yearmonth /// i.e dataframe which satisfy this condition i am able to do it in pandas but facing challenge in pyspark
dict_of_YearMonth[key_name]
dict_of_YearMonth
Then i want to concatenate all the dataframe into single pyspark dataframe, i could do this in pandas as shown below but i need to do in pyspark
snapshots=pd.concat([dict_of_YearMonth['Snapshot_201104'],dict_of_YearMonth['Snapshot_201105']])
If is any other idea to generate dictionary of dynamic data frame with dynamic addition of columns and perform condition and generate year based data frame and merge them in single data frame. Any help would be appreciated.
I have tried below code is working fine
// Function to append all the dataframe using union
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
// convert dates
def is_date(x):
try:
x= str(x)+str('01')
parse(x)
return datetime.datetime.strptime(x, '%Y%m%d').strftime("%Y-%m-%d")
except ValueError:
pass # if incorrect format, keep trying other format
dict_of_YearMonth = {}
for yearmonth in [200901,200910]:
key_name = 'Snapshot_'+str(yearmonth)
dict_of_YearMonth[key_name]=df
func = udf(lambda x: yearmonth, StringType())
dict_of_YearMonth[key_name] = df.withColumn("test",func(col('v')))
default_date = udf (lambda x : is_date(x))
dict_of_YearMonth[key_name] = dict_of_YearMonth[key_name].withColumn("test_date",default_date(col('test')).cast(DateType()))
dict_of_YearMonth
To add mutiple dataframes use below code:
final_df = unionAll(dict_of_YearMonth['Snapshot_200901'], dict_of_YearMonth['Snapshot_200910'])

Writing to a csv using pandas with filters

I'm using the pandas library to load in a csv file using Python.
import pandas as pd
df = pd.read_csv("movies.csv")
I'm then checking the columns for specific values or statements, such as:
viewNum = df["views"] >= 1000
starringActorNum = df["starring"] > 3
df["title"] = df["title"].astype("str")
titleLen = df["title"].str.len() <= 10
I want to create a new csv file using the criteria above, but am unsure how to do that as well as how to combine all those attributes into one csv.
Anyone have any ideas?
Combine the boolean masks using & (bitwise-and):
mask = viewNum & starringActorNum & titleLen
Select the rows of df where mask is True:
df_filtered = df.loc[mask]
Write the DataFrame to a csv:
df_filtered.to_csv('movies-filtered.csv')
import pandas as pd
df = pd.read_csv("movies.csv")
viewNum = df["views"] >= 1000
starringActorNum = df["starring"] > 3
df["title"] = df["title"].astype("str")
titleLen = df["title"].str.len() <= 10
mask = viewNum & starringActorNum & titleLen
df_filtered = df.loc[mask]
df_filtered.to_csv('movies-filtered.csv')
You can use the panda.DataFrame.query() interface. It allows text string queries, and is very fast for large data sets.
Something like this should work:
import pandas as pd
df = pd.read_csv("movies.csv")
# the len() method is not available to query, so pre-calculate
title_len = df["title"].str.len()
# build the data frame and send to csv file, title_len is a local variable
df.query('views >= 1000 and starring > 3 and #title_len <= 10').to_csv(...)

Categories