I am trying to average each cell of a bunch of .csv files to export as a single averaged .csv file using Pandas.
I have no problems, creating the dataframe itself, but when I try to turn it into a Panel (i.e. panel=pd.Panel(dataFrame)), I get the error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects pandas pd.panel
An example of what each csv file looks like:
Year, Month, Day, Latitude, Longitude, Value1, Value 2
2010, 06, 01, 23, 97, 1, 3.5
2010, 06, 01, 24, 97, 5, 8.2
2010, 06, 01, 25, 97, 6, 4.6
2010, 06, 01, 26, 97, 4, 2.0
Each .csv file is from gridded data so they have the same number of rows and columns, as well as some no data values (given a value of -999.9), which my code snippet below addresses.
The code that I have so far to do this is:
june=[]
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
june.append(csv1)
dfs={i: pd.DataFrame.from_csv(i) for i in june}
panel=pd.Panel(dfs)
panels=panel.replace(-999.9,np.NaN)
dfs_mean=panels.mean(axis=0)
I have seen questions where the user is getting the same error, but the solutions for those questions doesn't seem to work with my issue. Any help fixing this, or ideas for a better approach would be greatly appreciated.
pd.Panel has been deprecated
Use pd.concat with a dictionary comprehension and take the mean over level 1.
df1 = pd.concat({f: pd.read_csv(f) for f in glob('meansample[0-9].csv')})
df1.mean(level=1)
Year Month Day Latitude Longitude Value1 Value 2
0 2010 6 1 23 97 1 3.5
1 2010 6 1 24 97 5 8.2
2 2010 6 1 25 97 6 4.6
3 2010 6 1 26 97 4 2.0
I have a suggestion to change the approach a bit. Instead of converting each DF into panel, just concat them into one big DF but for each one give a unique ID. After you can just do groupby by the ID and use mean() to get the result.
It would look similar to this:
import Pandas as pd
df = pd.DataFrame()
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
temp_df = pd.read_csv(csv1)
temp_df['df_id'] = csv1
df = pd.concat([df, temp_df])
df.replace(-999.9, np.nan)
df = df.groupby("df_id").mean()
I hope it helps somehow, if you still have any issues with that let me know.
Related
I receive some data in 11 different pandas series. I need to combine the whole data into one pandas dataframe to carry out further analysis and reporting.
The format in which the data is received is as under:
Series1:
Sales
Item Series Year
A Sal 2018 100
2019 200
B Sal 2018 300
2019 400
Series2:
Purchases
Item Series Year
A Pur 2018 50
2019 100
B Pur 2018 150
2019 200
Series3:
Expenses
Product Series Year
A Exp 2019 100
B Exp 2019 200
I have a number of series parameter. So, I created a loop where the following code merges two of the total series till the all series are merged. I have tried to consolidate all such series into one dataframe using this code:
df = pd.merge(df,series1,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
But even if we write separate lines for each two pairs for our example here, it will be:
df = pd.merge(series1,series2,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
df = pd.merge(df,series3,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
However the issue with this is:
It only allows to merge two series at a time.
When I merge the third series in this example, as it doesn't have data for 2018, instead of putting NULL there, it remove the 2018 rows for even the series 1 and series 2 data in the dataframe. So, I am only left with merged data from all three series for 2019.
I considered converting all the series to list individually and then converting those lists to a dictionary, which then is converted into a dataframe. That works, but requires a lot of effort and requires code change if number of series changes. So, this doesn't work for me.
Any other way to do this?
Did you try using the to_frame method?
For example, you could use
df = pd.Series["a", "b", "c"]
df.to_frame()
to convert.
Try using this method in your data frame.
Here's it in the docs.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html
Try pd.concat():
import pandas as pd
import pandas as pd
s1 = pd.Series([100, 200, 300, 400], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['1','1','2','2'], [2018, 2019, 2018, 2019]]))
s2 = pd.Series([50, 100, 150, 200], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['3','3','4','4'], [2018, 2019, 2018, 2019]]))
s3 = pd.Series([100, 200], index = pd.MultiIndex.from_arrays([['A','B'],['5','6'], [2019, 2019]]))
df = pd.concat([s.droplevel(1) for s in [s1, s2, s3]], axis = 1)
0 1 2
A 2018 100 50 NaN
2019 200 100 100.0
B 2018 300 150 NaN
2019 400 200 200.0
Based on the code below, I'm trying to assign some columns to my DataFrame which has been grouped by month of the date and works well :
all_together = (df_clean.groupby(df_clean['ContractDate'].dt.strftime('%B'))
.agg({'Amount': [np.sum, np.mean, np.min, np.max]})
.rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'}))
But for some reason when I try to plot the result(in any kind as plot), it's not able to recognize my "ContractDate" as a column and also any of those renamed names such as: 'sum_amount'.
Do you have any idea that what's the issue and what am I missing as a rule for plotting the data?
I have tried the code below for plotting and it asks me what is "ContractDate" and what is "sum_amount"!
all_together.groupby(df_clean['ContractDate'].dt.strftime('%B'))['sum_amount'].nunique().plot(kind='bar')
#or
all_together.plot(kind='bar',x='ContractDate',y='sum_amount')
I really appreciate your time
Cheers,
z.A
When you apply groupby function on a DataFrame, it makes the groupby column as index(ContractDate in your case). So you need to reset the index first to make it as a column.
df = pd.DataFrame({'month':['jan','feb','jan','feb'],'v2':[23,56,12,59]})
t = df.groupby('month').agg('sum')
Output:
v2
month
feb 115
jan 35
So as you see, you're getting months as index. Then when you reset the index:
t.reset_index()
Output:
month v2
0 feb 115
1 jan 35
Next when you apply multiple agg functions on a single column in the groupby, it will create a multiindexed dataframe. So you need to make it as single level index:
t = df.groupby('month').agg({'v2': [np.sum, np.mean, np.min, np.max]}).rename(columns={'sum': 'sum_amount', 'mean': 'avg_amount', 'amin': 'min_amount', 'amax': 'max_amount'})
v2
sum_amount avg_amount min_amount max_amount
month
feb 115 57.5 56 59
jan 35 17.5 12 23
It created a multiindex.if you check t.columns, you get
MultiIndex(levels=[['v2'], ['avg_amount', 'max_amount', 'min_amount', 'sum_amount']],
labels=[[0, 0, 0, 0], [3, 0, 2, 1]])
Now use this:
t.columns = t.columns.get_level_values(1)
t.reset_index(inplace=True)
You will get a clean dataframe:
month sum_amount avg_amount min_amount max_amount
0 feb 115 57.5 56 59
1 jan 35 17.5 12 23
Hope this helps for your plotting.
There are several posts on how to drop rows if one column in a dataframe holds a certain undesired string, but I am struggling with how to do that if I have to check all columns in a dataset for that string, AND if I do not know beforehand exactly which column contains the string.
Suppose:
data = pd.DataFrame({'col1' : ['December 31,', 'December 31, 2019', 'countryB', 'countryC'],
'col2' : ['December 31,', 21, 19, 18],
'col3' : [np.NaN, 22, 23, 14]})
Which gives:
col1 col2 col3
0 December 31, December 31, NaN
1 December 31, 2019 21 22.0
2 countryB 19 23.0
3 countryC 18 14.0
I want to delete all rows that contain December 31,, but not if December 31, is followed by a year in YYYY format. Is use a regex for that: r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})', which properly identifies December 31, only.
The problem is that I have many of such tables, and I do not know beforehand in which column the December 31, (or its equivalent for other months) appears.
What I currently do is:
delete = pd.DataFrame(columns = data.columns)
for name, content in data.iteritems():
take = data[data[name].astype(str).str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
regex = True,
flags = re.IGNORECASE & re.DOTALL, na = False)]
delete = delete.append(take)
delete = delete.drop_duplicates()
index = mean(delete.index)
clean = data.drop([index])
Which returns, as desired:
col1 col2 col3
1 December 31, 2019 21 22.0
2 countryB 19 23.0
3 countryC 18 14.0
That is, I loop over all columns in data, store in delete the rows that I want to delete from data, delete duplicates (because December 31, appears in col1 and col2), get the index of the unique undesired row (0 here) and then delete that row in data based on the index. It does work, but that seems like a cumbersome way of achieving this.
I am wondering: Is there a better way of deleting all rows in which December 31, appears in any column?
data[~data.apply(lambda x: any([True if re.match('December 31,$',str(y)) else False for y in x]), axis=1)]
You can use .apply method to filter rows like this.
Doesn't using r"December 31,$"' regex works for your case? $ represent ending of string. If not just replace regex with your working regex.
Use pd.DataFrame.any(...)
mask = data.astype(str).apply(lambda x: x.str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
regex = True, flags = re.IGNORECASE & re.DOTALL, na = False)).any(axis=1)
data.loc[~mask]
I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice
You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31
the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!
No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().
I have a df that I scraped from coinmarketcap. I am trying to calculate volitlity metrics for the close_price column but when I use a groupby I'm getting an error message:
final_coin_data['vol'] = final_coin_data.groupby('coin_name')['close_price'].rolling(window=30).std()
TypeError: incompatible index of inserted column with frame index
df structure (the 'Unnamed:0' came after I loaded my CSV):
Unnamed: 0 close_price coin_name date high_price low_price market_cap open_price volume
0 1 9578.63 Bitcoin Mar 11, 2018 9711.89 8607.12 149,716,000,000 8852.78 6,296,370,000
1 2 8866.00 Bitcoin Mar 10, 2018 9531.32 8828.47 158,119,000,000 9350.59 5,386,320,000
2 3 9337.55 Bitcoin Mar 09, 2018 9466.35 8513.03 159,185,000,000 9414.69 8,704,190,000
3 1 9578.63 Monero Mar 11, 2018 9711.89 8607.12 149,716,000,000 8852.78 6,296,370,000
4 2 8866.00 Monero Mar 10, 2018 9531.32 8828.47 158,119,000,000 9350.59 5,386,320,000
5 3 9337.55 Monero Mar 09, 2018 9466.35 8513.03 159,185,000,000 9414.69 8,704,190,000
(ignore the incorrect prices, this is the basics of the df)
When using the following code:
final_coin_data1['vol'] = final_coin_data.groupby('coin_name')['close_price'].rolling(window=30).std().reset_index(0,drop=True)
I got a MemoryError. I thought I was using groupby correctly. If I take out the final_coin_data1['vol'] = then I get a series which appears correct, but it won't let me insert back into the df.
When I first started this project. I had just 1 coin and used the code below and it calculated volatility no problem.
final_coin_data1['vol'] = final_coin_data['close_price'].rolling(window=30).std()
When I ran this,
final_coin_data['close_price'].rolling(window=30).std()
an index column and result column are generated. When I tried to merge back to the original df as a new column final_coin_data1['vol'] I was getting an error TypeError: incompatible index of inserted column with frame index so to correct this, I reset_index(drop=True) then this eliminated the index which allowed the result to be joined on the final_coin_data1['vol'].
The final functioning code looks like this:
final_coin_data1['vol'] = final_coin_data.groupby('coin_name')['close_price'].rolling(window=30).std().reset_index(0,drop=True)