Use Pandas GroupBy Columns in new DataFrame - python

I have a large temperature time series that I'm performing some functions on. I'm taking hourly observations and creating daily statistics. After I'm done with my calculations, I want to use the grouped year and Julian days that are objects in the Groupby ('aa' below) and the drangeT and drangeHI arrays that come out and make an entirely new DataFrame with those variables. Code is below:
import numpy as np
import scipy.stats as st
import pandas as pd
city = ['BUF']#,'PIT','CIN','CHI','STL','MSP','DET']
mons = np.arange(5,11,1)
for a in city:
data = 'H:/Classwork/GEOG612/Project/'+a+'Data_cut.txt'
df = pd.read_table(data,sep='\t')
df['TempF'] = ((9./5.)*df['TempC'])+32.
df1 = df.loc[df['Month'].isin(mons)]
aa = df1.groupby(['Year','Julian'],as_index=False)
maxT = aa.aggregate({'TempF':np.max})
minT = aa.aggregate({'TempF':np.min})
maxHI = aa.aggregate({'HeatIndex':np.max})
minHI = aa.aggregate({'HeatIndex':np.min})
drangeT = maxT - minT
drangeHI = maxHI - minHI
df2 = pd.DataFrame(data = {'Year':aa.Year,'Day':aa.Julian,'TRange':drangeT,'HIRange':drangeHI})
All variables in the df2 command are of length 8250, but I get this error message when I run the it:
ValueError: cannot copy sequence with size 3 to array axis with dimension 8250
Any suggestions are welcomed and appreciated. Thanks!

Related

How can I merge the numerous data of two columns within the same DataFrame?

here is a pic of df1 = fatalities
So, in order to create a diagram that displays the years with the most injuries(i have an assignment about plane crash incidents in Greece from 2000-2020), i need to create a column out of the minor_injuries and serious_injuries ones.
So I had a first df with more data, but i tried to catch only the columnw that i needed, so we have the fatalities df1, which contains the years, the fatal_injuries, the minor_injuries, the serious_injuries and the total number of incident per year(all_incidents). What i wish to do, is merge the minor and serious injuries in a column named total_injuries or just injuries.
import pandas as pd
​ pd.set_option('display.max_rows', None)
df = pd.read_csv('all_incidents_cleaned.csv')
df.head()
df\['Year'\] = pd.to_datetime(df.incident_date).dt.year
fatalities = df.groupby('Year').fatalities.value_counts().unstack().reset_index()fatalities\
['all_incidents'\] = fatalities\[\['Θανάσιμος τραυματισμός',
'Μικρός τραυματισμός','Σοβαρός τραυματισμός', 'Χωρίς Τραυματισμό'\]\].sum(axis=1)
df\['percentage_deaths_to_all_incidents'\] = round((fatalities\['Θανάσιμος
τραυματισμός'\]/fatalities\['all_incidents'\])\*100,1)
df1 = fatalities
fatalities_pd = pd.DataFrame(fatalities)
df1
fatalities_pd.rename(columns = {'Θανάσιμος τραυματισμός':'fatal_injuries','Μικρός τραυματισμός':
'minor_injuries', 'Σοβαρός τραυματισμός' :'serious_injuries', 'Χωρίς Τραυματισμό' :
'no_injuries'}, inplace = True)
df1
For your current dataset two steps are needed.
First i would replace the "NaN" values with 0.
This could be done with:
df1.fillna(0)
Then you can create a new column "total_injuries" with the sum of minor and serious injuries:
df1["total_injuries"]=df1["minor_injuries"]+df1["serious_injuries"]
Its always nice when you first check your data for consistency before working on it. Helpful commands would look like:
data.shape
data.info()
data.isna().values.any()
data.duplicated().values.any()
duplicated_rows = data[data.duplicated()]
len(duplicated_rows)
data.describe()

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

How to find Date of 52 Week High and date of 52 Week low using pandas dataframe (Python)?

Please refer below table to for reference
I was able to find 52 Week High and low using:
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
Can someone please guide me how to find Date of 52 Week High and date of 52 Week low? Thanks in Advance.
My guess is that the date is another column in the dataframe, assuming its name is 'Date'.
you can try something like
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
df_low = df[df['LOW']== df['52W L'] ]
low_date = df_low['Date']
Similarly you can look for high values
Also it would have helped if you shared your sample dataframe structure.
Used 'pandas_datareader' data. The index is reset first. Then, using the idxmax() and idxmin() functions, the indices of highs and lows are found and lists are created from these values. The index of the 'Date' column is again set. And lists with indexes are fed into df.index. Note how setting indexes in df.index nan values are not involved.
High, Low replace with yours in df.
import pandas as pd
import pandas_datareader.data as web
import numpy as np
df = web.DataReader('GE', 'yahoo', start='2012-01-10', end='2019-10-09')
df = df.reset_index()
imax = df['High'].rolling(window=252, center=False).apply(lambda x: x.idxmax()).values
imin = df['Low'].rolling(window=252, center=False).apply(lambda x: x.idxmin()).values
count0_imax = np.count_nonzero(np.isnan(imax))
count0_imin = np.count_nonzero(np.isnan(imin))
imax = imax[count0_imax:].astype(int)
imin = imin[count0_imin:].astype(int)
df = df.set_index('Date')
df.loc[df.index[count0_imax]:, '52W H'] = df.index[imax]
df.loc[df.index[count0_imin]:, '52W L'] = df.index[imin]

Sample Data Script Inquiry?

I'm working on a script that takes a sample from an Excel file, and produces a sample from a sample data calculation. I'm trying to make sure the formula distributes the sample evenly between all unique categories, but I'm not entirely sure where to start.
import pandas as pd
import random
df = pd.read_excel("C:/Users/bryanmccormack/Desktop/Test_Catalog.xlsx")
df2 = df.loc[(df['Track Item']=='Y')]
category_total = df2['Category'].nunique()
total_rows = len(df2.axes[0])
ss = (2.58**2)*(.5)*(1-.5)/.04**2
ss2 = 1+(ss-1/total_rows)
ss3 = ss/ss2
ss4 = round(ss3 * 1000)
category = ss4 / category_total
df3 = df2.groupby('Category').apply(lambda x: x.sample(category))
df3 has 3774 items, and the sample formula takes 999 items, but I'm getting this error: "ValueError: Cannot take a larger sample than population when 'replace=False'"
Any idea why my code is wrong?

Grouping levels - multiindex in python pandas pivot_table

I've a multiindex dataframe in pandas that looks this (created using pivot_table):
I need help to add a level above (or below) the Date level showing the day of the date like this:
I know I can get the day of a date like this:
lt.DATE.dt.strftime('%a')
#lt is a dataframe and DATE is a column it.
Here is the code reporduce a similar pivot_table:
import pandas as pd
import numpy as np
dlist = pd.date_range('2015-01-01',periods=5)
df = pd.DataFrame(dlist, columns=['DATE'])
df['EC'] = range(7033,7033+len(df))
df['HS'] = np.random.randint(0,9,5)
df['AH'] = np.random.randint(0,9,5)
pv = pd.pivot_table(df, columns=[df.DATE, 'EC'], values=['HS','AH'])
pv = pv.unstack(level=1).unstack(level=0)
I got the solution! Here it goes:
import pandas as pd
import numpy as np
dlist = pd.date_range('2015-01-01',periods=5)
df = pd.DataFrame(dlist, columns=['DATE'])
df['EC'] = range(7033,7033+len(df))
df['HS'] = np.random.randint(0,9,5)
df['AH'] = np.random.randint(0,9,5)
df['DAY'] = df.DATE.dt.strftime('%a')
pv = pd.pivot_table(df, columns=[df.DATE.dt.date, df.DAY, 'EC'], values=['HS','AH'])
pv = pv.unstack(level=[1,2]).unstack(level=0)
pv.to_excel('solution.xlsx')
And it produces an output like this:
Pay attention to the function unstack and set the list of levels that are required to be unstacked at a time.

Categories