Pandas DataFrame sorting issues, grouping for no reason? - python
I have one data frame containing stats about NBA season. I'm simply trying to sort by date, but for some reason it's grouping all games that have the same data and changing the values of that said date to the same values.
df = pd.read_csv("gamedata.csv")
df["Total"] = df["Tm"] + df["Opp.1"]
teams = df['Team']
df = df.drop(columns=['Team'])
df.insert(loc=4, column='Team', value=teams)
df["W/L"] = df["W/L"]=="W"
df["W/L"] = df["W/L"].astype(int)
df = df.sort_values("Date")
df.to_csv("gamedata_clean.csv")
Before
After
I expected the df to be unchanged except for the order to be in ascending date, but it's changing values in other columns for reasons I do not know.
Please add this line to your code to sort your dataframe by date
df.sort_values(by='Date')
I hope you will get the desired output
Related
Pandas Data frame: Fastest way to shift columns based on a condition
I am trying to find the fastest way to shift a column based on a condition on another column. For example: Assuming the given input is : Expected Output is : Any help is greatly appreciated. I have tried to run through each unique name using a for loop. performed the shift(-1) on the data frame and append data to a new data frame. Code example below. But there are over 1M rows and this takes lot of time to compute. Assuming df2 being our sorted dataframe. df3 = pd.DataFrame() for i in df2['Name'].unique(): if df3.size == 0: df3=df2.loc[df2['name']==i] df3['final'] = df3['value'].shift(-1) df3.fillna('Exit', inplace=True) else: df4=df2.loc[df2['name']==i] df4['final'] = df4['value'].shift(-1) df4.fillna('Exit', inplace=True) df3 = df3.append(df4, ignore_index = True)
More efficient way to group by and count values Pandas dataframe
A more efficient way to do this? I have a sales records imported from a spreadsheet. I start by importing that list to a dataframe. I then need to get the average orders per customer by month and year. The spreadsheet does not contain counts, just order and customer ID. So I have to count each ID then get drop duplicates and then reset index. Final dataframe is exported back into a spreadsheet and SQL database. The code below works, and I get the desired output, but it seems it should be more efficient? I am new to pandas and Python so I'm sure I could do this better. df_customers = df.filter( ['Month', 'Year', 'Order_Date', 'Customer_ID', 'Customer_Name', 'Patient_ID', 'Order_ID'], axis=1) df_order_count = df.filter( ['Month', 'Year'], axis=1) df_order_count['Order_cnt'] = df_customers.groupby(['Month', 'Year'])['Order_ID'].transform('nunique') df_order_count['Customer_cnt'] = df_customers.groupby(['Month', 'Year'])['Customer_ID'].transform('nunique') df_order_count['Avg'] = (df_order_count['Order_cnt'] / df_order_count['Customer_cnt']).astype(float).round(decimals=2) df_order_count = df_order_count.drop_duplicates().reset_index(drop=True)
Try this g = df.groupby(['Month', 'Year']) df_order_count['Avg'] = g['Order_ID'].transform('nunique')/g['Customer_ID'].transform('nunique')
Creating a table for time series analysis including counts
I got a dataframe on which I would like to perform some analysis. An easy example of what I would like to achieve is, having the dataframe: data = ['2017-02-13', '2017-02-13', '2017-02-13', '2017-02-15', '2017-02-16'] df = pd.DataFrame(data = data, columns = ['date']) I would like to create a new dataframe from this. The new dataframe should contain 2 columns, the entire date span. So it should also include 2017-02-14 and the number of times each date appears in the original data. I managed to construct a dataframe that includes all the dates as so: dates = pd.to_datetime(df['date'], format = "%Y-%m-%d") dateRange = pd.date_range(start = dates.min(), end = dates.max()).tolist() df2 = pd.DataFrame(data = datumRange, columns = ['datum']) My question is, how would I add the counts of each date from df to df 2? I've been messing around trying to write my own functions but have not managed to achieve it. I am assuming this needs to be done more often and that I am thinking to difficult...
Try this: df2['counts'] = df2['datum'].map(pd.to_datetime(df['date']).value_counts()).fillna(0).astype(int)
How to solve ValueError: left keys must be sorted when merging two Pandas dataframes?
I'm trying to merge two Pandas dataframes, one called SF1 with quarterly data, and one called DAILY with daily data. Daily dataframe: ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps A,2020-09-14,2020-09-14,31617.1,36.3,26.8,30652.1,6.2,44.4,5.9 SF1 dataframe: ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital A,ARQ,2020-09-14,2020-09-14,2020-09-14,2020-09-14,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000 I already have sorted the dataframes by date and by ticker. Data sorting code: daily['date'] = pd.to_datetime((daily['date'])) sf1['calendardate'] = pd.to_datetime(sf1['calendardate']) daily = daily.sort_values(['ticker', 'date']) sf1 = sf1.sort_values(['ticker', 'calendardate']) Sorted DAILY: ,ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps 180766,AAPL,2007-05-30,2020-08-31,95640.1,24.1,22.6,102735.1,8.4,36.8,4.8 180716,AAPL,2007-05-31,2020-08-31,97722.9,24.7,23.1,104817.9,8.5,37.6,4.9 Sorted SF1: ticker,calendardate,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pe1,ppnenet,prefdivis,price,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital 0,AAPL,2007-06-30,56000000.0,21647000000.0,,18745000000.0,2902000000.0,,0.552,-283000000.0,7118000000.0,7118000000.0,3415000000.0,818000000.0,2.681,0.615,0.0,0.0,0.0,0.0,0.0,81000000.0,0.0,0.0,0.0,1196000000.0,1277000000.0,0.236,1277000000.0,1196000000.0,1196000000.0,0.034,0.033,0.034,13404000000.0,,13404000000.0,944000000.0,0.039,1.0,1995000000.0,0.369,275000000.0,0.0,7262000000.0,,251000000.0,6649000000.0,6649000000.0,0.0,8243000000.0,6992000000.0,1251000000.0,23000000.0,-6000000.0,118000000.0,0.0,0.0,229000000.0,-1433000000.0,-1170000000.0,1227000000.0,0.0,818000000.0,818000000.0,818000000.0,0.0,0.0,0.151,954000000.0,1041000000.0,3660000000.0,0.0,36.815,1626000000.0,0.0,4.7860000000000005,5.134,1410000000.0,8199000000.0,5410000000.0,5410000000.0,208000000.0,,,,,65000000.0,746000000.0,1.0,24349946740.0,24270568000.0,24938788000.0,0.223,21372000000.0,687000000.0,378000000.0,0.0,0.8809999999999999,11753000000.0 48,AAPL,2007-06-30,56000000.0,21647000000.0,19256000000.0,18745000000.0,2902000000.0,1.175,0.552,-675000000.0,7118000000.0,7118000000.0,15150000000.0,3134000000.0,2.681,0.615,0.0,0.0,0.0,0.0,0.0,290000000.0,0.0,0.0,0.0,4499000000.0,4789000000.0,0.212,4789000000.0,4499000000.0,4499000000.0,0.13,0.127,0.13,13404000000.0,11719250000.0,13404000000.0,4154000000.0,0.171,1.0,7476000000.0,0.33,275000000.0,0.0,7262000000.0,5515250000.0,251000000.0,6649000000.0,6649000000.0,0.0,8243000000.0,6992000000.0,1251000000.0,-895000000.0,-222000000.0,325000000.0,0.0,0.0,650000000.0,-6374000000.0,-5492000000.0,4829000000.0,0.0,3134000000.0,3134000000.0,3134000000.0,0.0,0.0,0.139,3519000000.0,3957000000.0,3660000000.0,0.0,36.815,1626000000.0,0.0,4.7860000000000005,5.134,1410000000.0,8199000000.0,22626000000.0,22626000000.0,754000000.0,0.163,0.267,0.816,0.199,214000000.0,2765000000.0,1.0,24349946740.0,24270568000.0,24938788000.0,0.932,21372000000.0,687000000.0,1365000000.0,0.0,0.8809999999999999,11753000000.0 Merging code: df = pd.merge_asof(daily, sf1, by='ticker', left_on='date', right_on='calendardate') Error message: ValueError: left keys must be sorted Not sure why I'm getting the error. Might be due to the inconsistencies of when the dataframes first dates start.
The sorting by ticker is not necessary as this is used for the exact join. Moreover, having it as first column in your sort_values calls prevents the correct sorting on the columns for the backward-search, namely date and calendardate. Try: daily = daily.sort_values(['date']) sf1 = sf1.sort_values(['calendardate'])
Pandas colnames not found after grouping and aggregating
Here is my data threats = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv', index_col = 0) And here is my code - df = (threats .query('threatened>0') .groupby(['continent', 'threat_type']) .agg({'threatened':'size'})) However df.columns only Index(['threatened'], dtype='object') is the result. That is, only the threatened column is displaying not the columns I have actually grouped by i.e continent and threat_type although present in my data frame. I would like to perform operation on the continent column of my data frame, but it is not displaying as one of the columns. For eg - continents = df.continent.unique(). This command gives me a key error of continent not found.
After groupby...pandas put the groupby columns in the index. Always reset index after doing groupby in pandas and don't do drop=True. After your code. df = df.reset_index() And then you will get required columns.