I am using an extremely large dataset with around 1.6 million individual entries for the timespan I am trying to observe (1948 - 1960). An example of my dataset loaded into pandas before attempting to be averaged looks like this.
import pandas as pd
import pyreadr
data = pyreadr.read_r('C:/fileLocation/file.rds')
df = data[None]
df['time'] = pd.to_datetime(df['time'])
df.set_index('time', inplace=True)
df = df['1948':'1960']
print(df.info())
df_groups = df.groupby(['lat', 'lon'])['spei'].mean()
print(df_groups.head())
Now the answer I get
An example input/output could look like as follows
What I am trying to accomplish is to take pairs of latitude and longitude values, and take the average spei value of each pair, then create a new pandas data frame with those new pairs and the spei value attached with that pair to be plotted later. Instead, I am getting only 5 rows of seemingly random latitude and longitude values, instead of each unique pairs with average spei from all repeating lon/lat values. I've used this post to try and help get some answers but I have not been able to find a fix yet.
Thank you!
This should solve your issue:
import pandas as pd
# create sample dataframe
data = {
'lat': [40.0, 40.0, 41.0, 41.0, 42.0, 42.0],
'lon': [-105.0, -106.0, -105.0, -106.0, -105.0, -106.0],
'spei': [-1.2, -0.8, -0.5, -1.1, -1.3, -0.9]
}
df = pd.DataFrame(data)
# group by pairs of latitude and longitude and calculate the mean spei value for each pair
df_groups = df.groupby([df['lat'], df['lon']])['spei'].mean().reset_index()
df_groups.columns = ['lat', 'lon', 'spei_mean']
# print the resulting dataframe
print(df_groups)
which returns:
lat lon spei_mean
0 40.0 -106.0 -0.8
1 40.0 -105.0 -1.2
2 41.0 -106.0 -1.1
3 41.0 -105.0 -0.5
4 42.0 -106.0 -0.9
5 42.0 -105.0 -1.3
Related
I have a data frame of this format:
import pandas as pd
df = pd.DataFrame({
1: {'mean': 1.0, 'std': 0.8},
2: {'mean': 0.5, 'std': 0.2},
3: {'mean': 0.2, 'std': 0.1},
4: {'mean': 0.1, 'std': 0.1},
5: {'mean': 0.6, 'std': 0.2}
})
df
1 2 3 4 5
mean 1.0 0.5 0.2 0.1 0.6
std 0.8 0.2 0.1 0.1 0.2
Based on these values of mean and std, I am trying to generate a big data frame of randomly generated numbers normally distributed, which has the same number of columns but more rows:
full_noise = []
for mean, std in enumerate(df):
noise = np.random.normal(mean, std, [5, 1000])
full_noise.append(noise)
So, each column of this new data frame will have values generated on mean and std listed in the data frame above. I am definitely doing something wrong, though.
Sorry, I am quite new to Python! I hope you can help :(
To create what you want I would suggest iterating over the dataframe df one column at a time (to do so first transpose the dataframe and then use iterrows).
For each column you can generate a numpy array of the lenght you desire from a normal distribution using the mean and std from the column.
At the end you can concatenate the numpy arrays as columns of a dataframe (so along axis=1).
full_noise = []
for _, col in df.T.iterrows():
noise = np.random.normal(loc=col["mean"], scale=col["std"], size=(1000,))
full_noise.append(pd.Series(noise))
noise_df = pd.concat(full_noise, axis=1)
Using .apply to make full_noise.
full_noise = df.apply(
lambda col: np.random.normal(loc=col["mean"], scale=col["std"], size=(1_000,)),
)
print(full_noise)
1 2 3 4 5
0 0.900445 0.555275 0.206491 0.161578 0.491196
1 1.555625 0.261742 0.196981 -0.068225 0.770397
2 0.308983 0.256334 0.119617 0.157978 0.453351
3 0.799080 0.255109 0.164719 -0.088953 0.462583
4 1.263621 0.650327 0.217544 0.046004 0.893409
.. ... ... ... ... ...
995 1.345332 0.827836 0.320708 0.113350 0.789898
996 1.235461 0.464576 0.270596 0.049924 0.708799
997 1.211508 0.751700 0.230916 0.176736 0.661312
998 1.753942 0.941567 0.097372 0.177429 0.810710
999 1.847943 0.240993 -0.006139 0.200517 0.523238
[1000 rows x 5 columns]
I am trying (unsuccessfully) to create separate columns for embedded dictionary keys. Dict data looks like this:
{'averagePrice': 32.95,
'currentDayProfitLoss': 67.2,
'currentDayProfitLossPercentage': 0.02,
'instrument': {'assetType': 'EQUITY', 'cusip': '902104108', 'symbol': 'IIVI'},
'longQuantity': 120.0,
'marketValue': 4021.2,
'settledLongQuantity': 120.0,
'settledShortQuantity': 0.0,
'shortQuantity': 0.0}]
The 'instrument' key is what I am trying to flatten in to columns (ie assetType, cusip, symbol). Here is the code I last tried and still no indivdual columns
data = accounts_data_single
my_dict = data
headers = list(my_dict['securitiesAccount']['positions'])
dict1 = my_dict['securitiesAccount']['positions']
mypositions = pd.DataFrame(dict1)
pd.concat([mypositions.drop(['instrument'], axis=1), mypositions['instrument'].apply(pd.Series)], axis=1)
mypositions.to_csv('Amer_temp.csv')
Any suggestions are greatly appreciated
I am trying to get the nestled keys/fieldnames all in columns and then all the stock positions in the rows. The above code works great except the nestled 'instrument' keys are all in one column
averagePrice currentDayProfitLoss ... assetType cusip symbol
22.5 500 ... Equity 013245 IIVI
450 250 ... Equity 321354 AAPL
etc
Here's a way to do this. Let's say d is your dict.
Step 1: Convert the dict to dataframe
d1 = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)
Step 2: Convert the instrument column into dataframe
d2 = d1['instrument'].apply(pd.Series)
Step3: Join outputs of step1 and step2
df = pd.concat([d1.drop('instrument', axis=1), d2], axis=1)
Are you trying to do this:
pd.DataFrame(d).assign(**pd.DataFrame([x['instrument'] for x in d])).drop(columns='instrument')
output:
averagePrice currentDayProfitLoss currentDayProfitLossPercentage longQuantity marketValue settledLongQuantity settledShortQuantity shortQuantity assetType cusip symbol
0 32.95 67.2 0.02 120.0 4021.2 120.0 0.0 0.0 EQUITY 902104108 IIVI
1 31.95 63.2 0.01 100.0 3021.2 100.0 0.0 0.0 EQUITY 802104108 AAPL
I have a pandas DataFrame with 3 columns. The first column contains string values in ascending order, at a certain frequency (e.g. '20173070000', '20173070020', '20173070040', etc.). The second and third columns contain corresponding integer values. I would like to re-sample the first column to every one - '20173070000', '20173070001', '20173070002', simultaneously filling the second and third columns with NaN values, and then I would like to interpolate those NaN values.
I've looked into re-sampling data, but this appears to only work for timedate values. I have also looked into pd.interpolate, but this appears to work for interpolating between missing values. As stated above, my dataset does not contain missing data. I am simply looking to increase the frequency of my entries - to fill between existing values.
To give some reference, my current DataFrame looks like this:
0 1 2
0 20173070000 14.0 13.9
1 20173070020 14.1 14.1
2 20173070040 13.8 13.6
3 20173070060 13.7 13.7
4 20173070080 13.8 13.5
5 20173070100 13.9 14.0
I would like to generate a DataFrame that looks like:
0 1 2
0 20173070000 14.0 13.9
1 20173070001 NaN NaN
2 20173070002 NaN NaN
3 20173070003 NaN NaN
4 20173070004 NaN NaN
5 20173070005 NaN NaN
...
20 20173070020 14.1 14.1
21 20173070021 NaN NaN
...
I have no problem sorting the interpolation afterwards, but I have not worked out how to up sample yet.
You can just use reindex function. By default, it places NaN in locations having no value in the "new" index.
df = pd.DataFrame({'A': [20173070000, 20173070020, 20173070040, 20173070060, 20173070080, 20173070100 ],
'B': [14, 14.1, 13.8, 13.7, 13.8, 13.9],
'C': [13.9, 14.1, 13.6, 13.7, 13.5, 14.0] })
df.set_index('A').reindex(np.arange(np.min(df.A), np.max(df.A)+1) ).reset_index()
I believe the interpolate() is the way to go for you. After having upsampled as you described and given the column containing the values you want to interpolate is called 'val1', you can do:
df.loc[:, 'val1'] = df.loc[:, 'val1'].interpolate()
I have two dataframes:
df1 - is a pivot table that has totals for both columns and rows, both with default names "All"
df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.
I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.
When I perform multiplication, I get the following error:
ValueError: cannot join with no level specified and no overlapping names
When I try the same on dummy dataframes it works as expected:
import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
[50, 60, 70, 180],
[90, 10, 10, 110],
[150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)
table2 = np.matrix([[1.0, 2.0, 3.0],
[5.0, 6.0, 7.0],
[2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)
df3 = df1*df2
print(df3)
This gives me the following output:
A B C All
One 10 20 30 60
Two 50 60 70 180
Three 90 10 10 110
All 150 90 110 350
A B C
One 1.00 2.00 3.00
Two 5.00 6.00 7.00
Three 2.00 1.00 5.00
A All B C
All nan nan nan nan
One 10.00 nan 40.00 90.00
Three 180.00 nan 10.00 50.00
Two 250.00 nan 360.00 490.00
So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".
And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:
df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
columns = ['CR Range'], aggfunc = [np.sum], margins = True)
I do need to keep the total as I'm using them in other calculations.
I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.
Thank you for reading. I realize it's a very long post..
IIUC,
My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.
df1.mul(df2, fill_value=1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1
(df1 * df2).combine_first(df1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
I really like Pir 's approach , and here is mine :-)
df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]:
A B C All
One 10.0 40.0 90.0 60
Two 250.0 360.0 490.0 180
Three 180.0 10.0 50.0 110
All 150.0 90.0 110.0 350
#Wen, #piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.
Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:
Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .
Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?
This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.
I was then able to multiply the two dataframes with no issues. I used approach suggested by #Wen because I like that it preserves the structure.
This is a demo example of my DataFrame. The full DataFrame has multiple additional variables and covers 6 months of data.
sentiment date
1 2015-05-26 18:58:44
0.9 2015-05-26 19:57:31
0.7 2015-05-26 18:58:24
0.4 2015-05-27 19:17:34
0.6 2015-05-27 18:46:12
0.5 2015-05-27 13:32:24
1 2015-05-28 19:27:31
0.7 2015-05-28 18:58:44
0.2 2015-05-28 19:47:34
I want to group the DataFrame by just the day of the date column, but at the same time aggregate the median of the sentiment column.
Everything I have tried with groupby, the dt accessor and timegrouper has failed.
I want to return a pandas DataFrame not a GroupBy object.
The date column is M8[ns]
The sentiment column float64
You fortunately have the tools you need listed in your question.
In [61]: df.groupby(df.date.dt.date)[['sentiment']].median()
Out[61]:
sentiment
2015-05-26 0.9
2015-05-27 0.5
2015-05-28 0.7
I would do this :
df['date'] = df['date'].apply(lambda x : x.date())
df = df.groupby('date').agg({'sentiment':np.median}).reset_index()
You first replace the datetime column with the date.
Then you perform the groupby+agg operation.
I would do this, because you can do multiple aggregations (like median, mean, min, max, etc.) on multiple columns at the same time:
df.groupby(df.date.dt.date).agg({'sentiment': ['median']})
You can get any number of metrics using one group by and .agg() function
1) create new column extracting date.
2) Use groupy by and apply numpy.median,numpy.mean etc
import pandas as pd
x = [[1,'2015-05-26 18:58:44'],
[0.9,'2015-05-26 19:57:31']]
t = pd.DataFrame(x,columns = ['a','b'])
t.b = pd.to_datetime(t['b'])
t['datex'] = t['b'].dt.date
t.groupby(['datex']).agg({
'a': np.median
})
Output -
datex
2015-05-26 0.95