I have a df that I scraped from coinmarketcap. I am trying to calculate volitlity metrics for the close_price column but when I use a groupby I'm getting an error message:
final_coin_data['vol'] = final_coin_data.groupby('coin_name')['close_price'].rolling(window=30).std()
TypeError: incompatible index of inserted column with frame index
df structure (the 'Unnamed:0' came after I loaded my CSV):
Unnamed: 0 close_price coin_name date high_price low_price market_cap open_price volume
0 1 9578.63 Bitcoin Mar 11, 2018 9711.89 8607.12 149,716,000,000 8852.78 6,296,370,000
1 2 8866.00 Bitcoin Mar 10, 2018 9531.32 8828.47 158,119,000,000 9350.59 5,386,320,000
2 3 9337.55 Bitcoin Mar 09, 2018 9466.35 8513.03 159,185,000,000 9414.69 8,704,190,000
3 1 9578.63 Monero Mar 11, 2018 9711.89 8607.12 149,716,000,000 8852.78 6,296,370,000
4 2 8866.00 Monero Mar 10, 2018 9531.32 8828.47 158,119,000,000 9350.59 5,386,320,000
5 3 9337.55 Monero Mar 09, 2018 9466.35 8513.03 159,185,000,000 9414.69 8,704,190,000
(ignore the incorrect prices, this is the basics of the df)
When using the following code:
final_coin_data1['vol'] = final_coin_data.groupby('coin_name')['close_price'].rolling(window=30).std().reset_index(0,drop=True)
I got a MemoryError. I thought I was using groupby correctly. If I take out the final_coin_data1['vol'] = then I get a series which appears correct, but it won't let me insert back into the df.
When I first started this project. I had just 1 coin and used the code below and it calculated volatility no problem.
final_coin_data1['vol'] = final_coin_data['close_price'].rolling(window=30).std()
When I ran this,
final_coin_data['close_price'].rolling(window=30).std()
an index column and result column are generated. When I tried to merge back to the original df as a new column final_coin_data1['vol'] I was getting an error TypeError: incompatible index of inserted column with frame index so to correct this, I reset_index(drop=True) then this eliminated the index which allowed the result to be joined on the final_coin_data1['vol'].
The final functioning code looks like this:
final_coin_data1['vol'] = final_coin_data.groupby('coin_name')['close_price'].rolling(window=30).std().reset_index(0,drop=True)
Related
I have a data frame containing the daily CO2 data since 2015, and I would like to produce the monthly mean data for each year, then put this into a new data frame. A sample of the data frame I'm using is shown below.
month day cycle trend
year
2011 1 1 391.25 389.76
2011 1 2 391.29 389.77
2011 1 3 391.32 389.77
2011 1 4 391.36 389.78
2011 1 5 391.39 389.79
... ... ... ... ...
2021 3 13 416.15 414.37
2021 3 14 416.17 414.38
2021 3 15 416.18 414.39
2021 3 16 416.19 414.39
2021 3 17 416.21 414.40
I plan on using something like the code below to create the new monthly mean data frame, but the main problem I'm having is indicating the specific subset for each month of each year so that the mean can then be taken for this. If I could highlight all of the year "2015" for the month "1" and then average this etc. that might work?
Any suggestions would be hugely appreciated and if I need to make any edits please let me know, thanks so much!
dfs = list()
for l in L:
dfs.append(refined_data[index = 2015, "month" = 1. day <=31].iloc[l].mean(axis=0))
mean_matrix = pd.concat(dfs, axis=1).T
I have a question. I am dealing with a Datetime DataFrame in Pandas. I want to perform a count on a particular column and group by the month.
For example:
df.groupby(df.index.month)["count_interest"].count()
Assuming that I am analyzing a Data From December 2019. I get a result like this
date
1 246
2 360
3 27
12 170
In reality, December 2019 is supposed to come first. Please what can I do because when I plot the frame grouped by month, the December 2019 is showing at the last and this is practically incorrect.
See plot below for your understanding:
You can try reindex:
df.groupby(df.index.month)["count_interest"].count().reindex([12,1,2,3])
I need to create a data frame for 100 customer_id along with their expenses for each day starting from 1st June 2019 to 31st August 2019. I have customer id already in a list and dates as well in a list. How to make a data frame in the format shown.
CustomerID TrxnDate
1 1-Jun-19
1 2-Jun-19
1 3-Jun-19
1 Upto....
1 31-Aug-19
2 1-Jun-19
2 2-Jun-19
2 3-Jun-19
2 Upto....
2 31-Aug-19
and so on for other 100 customer id
I already have customer_id data frame using pandas function now i need to map each customer_id with the date ie assume we have customer id as 1 then 1 should have all dates from 1st June 2019 to 31 aug 2019 and then customerId 2 should have the same dates. Please see the data frame required.
# import module
import pandas as pd
# list of dates
lst = ['1-Jun-19', '2-Jun-19', ' 3-Jun-19']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
Repeat the operations for Customer ID and store in df2 or something and then
frames = [df, df2]
result = pd.concat(frames)
There are simpler methods , but this will give you a idea how it is carried out.
I see you want specific dataframes, so first creat the dataframes according to customer ID 1. then repeat same for Customer ID 2, and then concat those dataframes.
I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1
I am trying to average each cell of a bunch of .csv files to export as a single averaged .csv file using Pandas.
I have no problems, creating the dataframe itself, but when I try to turn it into a Panel (i.e. panel=pd.Panel(dataFrame)), I get the error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects pandas pd.panel
An example of what each csv file looks like:
Year, Month, Day, Latitude, Longitude, Value1, Value 2
2010, 06, 01, 23, 97, 1, 3.5
2010, 06, 01, 24, 97, 5, 8.2
2010, 06, 01, 25, 97, 6, 4.6
2010, 06, 01, 26, 97, 4, 2.0
Each .csv file is from gridded data so they have the same number of rows and columns, as well as some no data values (given a value of -999.9), which my code snippet below addresses.
The code that I have so far to do this is:
june=[]
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
june.append(csv1)
dfs={i: pd.DataFrame.from_csv(i) for i in june}
panel=pd.Panel(dfs)
panels=panel.replace(-999.9,np.NaN)
dfs_mean=panels.mean(axis=0)
I have seen questions where the user is getting the same error, but the solutions for those questions doesn't seem to work with my issue. Any help fixing this, or ideas for a better approach would be greatly appreciated.
pd.Panel has been deprecated
Use pd.concat with a dictionary comprehension and take the mean over level 1.
df1 = pd.concat({f: pd.read_csv(f) for f in glob('meansample[0-9].csv')})
df1.mean(level=1)
Year Month Day Latitude Longitude Value1 Value 2
0 2010 6 1 23 97 1 3.5
1 2010 6 1 24 97 5 8.2
2 2010 6 1 25 97 6 4.6
3 2010 6 1 26 97 4 2.0
I have a suggestion to change the approach a bit. Instead of converting each DF into panel, just concat them into one big DF but for each one give a unique ID. After you can just do groupby by the ID and use mean() to get the result.
It would look similar to this:
import Pandas as pd
df = pd.DataFrame()
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
temp_df = pd.read_csv(csv1)
temp_df['df_id'] = csv1
df = pd.concat([df, temp_df])
df.replace(-999.9, np.nan)
df = df.groupby("df_id").mean()
I hope it helps somehow, if you still have any issues with that let me know.