How can I count with a loop how many 2-up and 2-dn are in a column at the same index date in a panda dataframe?
df1 = pd.DataFrame()
index = ['2020-01-01','2020-01-01','2020-01-01','2020-01-08','2020-01-08','2020-01-08']
df1 = pd.DataFrame(index = index)
bars = ['1-inside','2-up','2-dn','2-up','2-up','1-inside']
df1['Strat'] = bars
df1
Result should be:
2020-01-01 2-up = 1, 2-dn = 1
2020-01-08 2-up = 2, 2-dn = 0
Afterwards I would like to plot the results with matplotlib.
Use SeriesGroupBy.value_counts for count, reshape by Series.unstack and then plot by DataFrame.plot.bar:
need = ['2-up','2-dn']
df1 = df1['Strat'].groupby(level=0).value_counts().unstack(fill_value=0)[need]
print (df1)
Strat 2-up 2-dn
2020-01-01 1 1
2020-01-08 2 0
Or you can filter before counts by Series.isin in boolean indexing:
need = ['2-up','2-dn']
df1 = (df1.loc[df1['Strat'].isin(need), 'Strat']
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
df1.plot.bar()
Related
I have a df which compares the new and old data. Is there a way to plot every 2 columns, with the x axis as the date? Or plot all the columns with the same rootname against the date. So there should be 1 line graph per fruit.
df
date apple_old apple_new banana_old banana_new
0 2015-01-01 5 6 4 2
...
I tried:
for col in df.columns:
if col .endswith("_old") and col .endswith("_new"):
x = x.plot(kind="line", x = date, y =(f"{col}_old", f"{col}_new"))
Use:
df1 = df.set_index('date')
df1.columns = df1.columns.str.split('_', expand=True)
for lev in df1.columns.levels[0]:
print (df1[lev].plot())
Try this this set comprehension:
l = list({i.split('_')[0] for i in df.columns[1:]})
for col in l:
x = x.plot(kind="line", x = date, y =(f"{col}_old", f"{col}_new"))
When I merge two dataframes, it keeps the columns from the left and the right dataframes
with a _x and _y appended.
But I want it to make it one column and 'merge' the values of the two columns such that:
when the values are the same it just puts that one value
when the values are different it keeps the value based on another column called 'date'
and takes the value which is the 'latest' based on the date.
I also tried doing it using concatenate and in this case it does 'merge' the two columns, but it just seems to 'append' the two rows.
In the code below for example, I would like to get as output the dataframe df_desired. How can I get that?
import pandas as pd
import numpy as np
np.random.seed(30)
company1 = ('comA','comB','comC','comD')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[100,200,300,400]
df1['date'] = [20191231,20191231,20191001,20190931]
print("\ndf1:")
print(df1)
company2 = ('comC','comD','comE','comF')
df2 = pd.DataFrame(columns=None)
df2['company'] = company2
df2['clv']=[300,450,500,600]
df2['date'] = [20191231,20191231,20191231,20191231]
print("\ndf2:")
print(df2)
df_desired = pd.DataFrame(columns=None)
df_desired['company'] = ('comA','comB','comC','comD','comE','comF')
df_desired['clv']=[100,200,300,450,500,600]
df_desired['date'] = [20191231,20191231,20191231,20191231,20191231,20191231]
print("\ndf_desired:")
print(df_desired)
df_merge = pd.merge(df1,df2,left_on = 'company',
right_on = 'company',how='outer')
print("\ndf_merge:")
print(df_merge)
# alternately
df_concat = pd.concat([df1, df2], ignore_index=True, sort=False)
print("\ndf_concat:")
print(df_concat)
One approach is to concat the two dataframes then sort the concatenated dataframe on date in ascending order and drop the duplicate entries(while keeping the latest entry) based on company:
df = pd.concat([df1, df2])
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df = df.sort_values('date', na_position='first').drop_duplicates('company', keep='last', ignore_index=True)
Result:
company clv date
0 comA 100 2019-12-31
1 comB 200 2019-12-31
2 comC 300 2019-12-31
3 comD 450 2019-12-31
4 comE 500 2019-12-31
5 comF 600 2019-12-31
I have two dataframes:
df1 with columns 'state', 'date', 'number'
df2 with columns 'state', 'specificDate' (one specificDate for one state, each state is mentioned just once)
In the end, I want to have a dataset with columns 'state', 'specificDate', 'number'. Also, I would like to add 14 days to each specific date and get numbers for those dates too.
I tried this
df = df1.merge(df2, left_on='state', right_on='state')
df['newcolumn'] = np.where((df.state == df.state)& (df.date == df.specificDate), df.numbers)
df['newcolumn'] = np.where((df.state == df.state)& (df.date == df.specificDate+datetime.timedelta(days=14)), df.numbers)
but I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
when I add all() it still gives me the same error
I feel that my logic is not correct. How else can I insert those values into my dataset?
I think you want to use df2 as the left side of the join. You can use pd.DateOffset to add 14 days.
# create dataset with specific date and specific date + 14
df2_14 = df2.set_index('state')['date'].apply(pd.DateOffset(14)).reset_index()
df = pd.concat([df2, df2_14])
# now join the values from df1
df = df.join(df1.set_index(['state', 'date']),
how='left',
on=['state', 'specificDate'])
You can declare an empty DataFrame and insert filtered data in it.
To filter data you may iterate through all rows of df2 and set a mask between the dates of specificDate column and specificDate+14 with same state name.
I have create two DataFrames df1 and df2 with several values from your DataFrames and tested the above procedure.
import pandas as pd
import datetime
data1 = {
"state":["Alabama","Alabama","Alabama"],
"date":["3/12/20", "3/13/20", "3/14/20"],
"number":[0,5,7]
}
data2 = {
"state": ["Alabama", "Alaska"],
"specificDate": ["03.13.2020", "03.11.2020"]
}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df2 = pd.DataFrame(data2)
df2['specificDate'] = pd.to_datetime(df2['specificDate'])
final_df = pd.DataFrame()
for index, row in df2.iterrows():
begin_date = row["specificDate"]
end_date = begin_date+datetime.timedelta(days=14)
mask = (df1['date'] >= begin_date) & (df1['date'] <= end_date) & (df1['state'] == row['state'])
filtered_data = df1.loc[mask]
if not filtered_data.empty:
final_df = final_df.append(filtered_data, ignore_index=True)
print(final_df)
Output:
state date number
0 Alabama 2020-03-13 5
1 Alabama 2020-03-14 7
Updated Answer:
To show the data only for specific date and specific date+14th date from df1 we should update the mask of the above code snippet.
import pandas as pd
import datetime
data1 = {
"state":["Alabama","Alabama","Alabama","Alabama","Alabama"],
"date":["3/12/20", "3/13/20", "3/14/20", "3/27/20", "3/28/20"],
"number":[0,5,7,9,3]
}
data2 = {
"state": ["Alabama", "Alaska"],
"specificDate": ["03.13.2020", "03.11.2020"]
}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df2 = pd.DataFrame(data2)
df2['specificDate'] = pd.to_datetime(df2['specificDate'])
final_df = pd.DataFrame()
for index, row in df2.iterrows():
first_date = row["specificDate"]
last_date = first_date+datetime.timedelta(days=14)
mask = ((df1['date'] == first_date) | (df1['date'] == last_date)) & (df1['state'] == row['state'])
filtered_data = df1.loc[mask]
if not filtered_data.empty:
final_df = final_df.append(filtered_data, ignore_index=True)
print(final_df)
Output:
state date number
0 Alabama 2020-03-13 5
1 Alabama 2020-03-27 9
Just a slight tweek on the first line in Eric's answer to make it a little simpler, as I was confused why he used set_index and reset_index.
df2_14['date'] = df2['date'].apply(pd.DateOffset(14))
I have a data frame and a series that I would like to return a rolling correlation as a new data frame.
So I have 3 columns in df1, I would like to return a new data frame that is the rolling correlation of each of these columns with a Series object.
import pandas as pd
df1 = pd.read_csv('https://bpaste.net/raw/d0456d3a020b')
df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index(df1['Date'])
del df1['Date']
df2 = pd.read_csv('https://bpaste.net/raw/d5cb455cb091')
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index(df2['Date'])
del df2['Date']
pd.rolling_corr(df1, df2)
result https://bpaste.net/show/58b59c656ce4
gives NaNs and 1s only
pd.rolling_corr(df1['IWM_Close'], spy, window=22)
gives the ideal series returned, but I did not want to loop through the columns of the data frame. Is there a better way to do it?
Thanks.
I believe your second input has to be a Series to be correlated with all columns in the first DataFrame.
This works:
index = pd.DatetimeIndex(start=date(2015,1,1), freq='W', periods = 100)
df1 = pd.DataFrame(np.random.random((100,3)), index=index)
df2 = pd.DataFrame(np.random.random((100,1)), index=index)
print(pd.rolling_corr(df1, df2.squeeze(), window=20).tail())
or, for the same result:
df2 = pd.Series(np.random.random(100), index=index)
print(pd.rolling_corr(df1, df2, window=20).tail())
0 1 2
2016-10-30 -0.170971 -0.039929 -0.091098
2016-11-06 -0.199441 0.000093 -0.096331
2016-11-13 -0.213728 -0.020709 -0.129935
2016-11-20 -0.075859 0.014667 -0.153830
2016-11-27 -0.114041 0.019886 -0.155472
but this doesn't - note the missing .squeeze() - only correlates the matching columns:
print(pd.rolling_corr(df1, df2, window=20).tail())
0 1 2
2016-10-30 0.019865 NaN NaN
2016-11-06 0.087075 NaN NaN
2016-11-13 0.011679 NaN NaN
2016-11-20 -0.004155 NaN NaN
2016-11-27 0.111408 NaN NaN
I'm having trouble using pd.merge after groupby. Here's my hypothetical:
import pandas as pd
from pandas import DataFrame
import numpy as np
df1 = DataFrame({'key': [1,1,2,2,3,3],
'var11': np.random.randn(6),
'var12': np.random.randn(6)})
df2 = DataFrame({'key': [1,2,3],
'var21': np.random.randn(3),
'var22': np.random.randn(3)})
#group var11 in df1 by key
grouped = df1['var11'].groupby(df1['key'])
# calculate the mean of var11 by key
grouped = grouped.mean()
print grouped
key
1 1.399430
2 0.568216
3 -0.612843
dtype: float64
print grouped.index
Int64Index([1, 2, 3], dtype='int64')
print df2
key var21 var22
0 1 -0.381078 0.224325
1 2 0.836719 -0.565498
2 3 0.323412 -1.616901
df2 = pd.merge(df2, grouped, left_on = 'key', right_index = True)
At this point, I get IndexError: list index out of range.
When using groupby, the grouping variable ('key' in this example) becomes the index for the resultant series, which is why I specify 'right_index = True'. I've tried other syntax without success. Any advice?
I think you should just do this:
In [140]:
df2 = pd.merge(df2,
pd.DataFrame(grouped, columns=['mean']),
left_on='key',
right_index=True)
print df2
key var21 var22 mean
0 1 0.324476 0.701254 0.400313
1 2 -1.270500 0.055383 -0.293691
2 3 0.804864 0.566747 0.628787
[3 rows x 4 columns]
The reason it didn't work is that grouped is a Series not a DataFrame