I have a pandas data frame df where I try to find the sum of hectares that need to be harvested area before the threshold day in the other pandas data frame lst is reached per state.
lst = pd.DataFrame()
lst['ST'] = ['CA', 'MA', 'TX', 'FL', 'OH', 'WY', 'AK']
lst['doy'] = [140, 150, 160, 170, 180, 190, 200]
print(df)
doy ST ... area left
0 111 AK ... 4.293174e+05 760964.996900
1 120 AK ... 4.722491e+06 760535.679500
2 121 AK ... 8.586347e+06 760149.293900
3 122 AK ... 2.683233e+07 758324.695200
4 122 AK ... 2.962290e+07 758045.638900
.. ... ... ... ... ...
111 211 AK ... 7.609006e+09 107.329336
112 212 AK ... 7.609221e+09 85.863469
113 213 AK ... 7.609435e+09 64.397602
114 214 AK ... 7.609650e+09 42.931735
115 215 AK ... 7.610079e+09 0.000000
So I would end up with a data frame that sums up all the area before the threshold doy in lst
area ST
5.0000+05 CA
4.0123+05 MA
3.1941+05 TX
4.0011+05 FL
1.2346+05 OH
87.318+05 WY
0.7133+05 AK
How can I achieve this?
You can mapping ST column by Series from lst and compare if less like df['doy'] column, filter in boolean indexing and aggregate sum:
df1 = (df[df['doy'].lt(df['ST'].map(lst.set_index('ST')['doy']))]
.groupby('ST', as_index=False)['area'].sum()[['area','ST']])
print (df1)
area ST
0 70193385.4 AK
If I understood you, you should filter the df by doy and the group by ST and sum.
Here is an example with doy before 108:
doy_threshold = 108
df[df['doy']<doy_threshold].groupby(by=["ST"]).sum()
Related
is there a website or a function that create a DataFrame examples code so that it can be used in tutorials?
something like this
df = pd.DataFrame({'age': [ 3, 29],
'height': [94, 170],
'weight': [31, 115]})
or
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
or
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
You can get over 750 datasets from pydataset
pip install pydataset
To see a list of the datasets:
from pydataset import data
# To see a list of the datasets
print(data())
Output:
dataset_id title
0 AirPassengers Monthly Airline Passenger Numbers 1949-1960
1 BJsales Sales Data with Leading Indicator
2 BOD Biochemical Oxygen Demand
3 Formaldehyde Determination of Formaldehyde
4 HairEyeColor Hair and Eye Color of Statistics Students
.. ... ...
752 VerbAgg Verbal Aggression item responses
753 cake Breakage Angle of Chocolate Cakes
754 cbpp Contagious bovine pleuropneumonia
755 grouseticks Data on red grouse ticks from Elston et al. 2001
756 sleepstudy Reaction times in a sleep deprivation study
[757 rows x 2 columns]
Usage
And to use one of the example datasets in a dataframe, it is as simple as using the dataset_id:
from pydataset import data
df = data('cake')
print(df)
Output:
replicate recipe temperature angle temp
1 1 A 175 42 175
2 1 A 185 46 185
3 1 A 195 47 195
4 1 A 205 39 205
5 1 A 215 53 215
.. ... ... ... ... ...
266 15 C 185 28 185
267 15 C 195 25 195
268 15 C 205 25 205
269 15 C 215 31 215
270 15 C 225 25 225
[270 rows x 5 columns]
Note:
There are other packages with their own functionality. Or you can create your own.
You can get over 17000 datasets from the datasets package:
pip install datasets
To list all of the datasets:
from datasets import list_datasets
# Print all the available datasets
print(list_datasets())
I would like to know how can I transform the day columns into week columns.
I tryed groupby.sum() but there is no column name pattern, I dont know what to groupby for.
So the result should be column name like 'weekX' - "week1(Sum of 7 first days) - week2 - week3" and so on.
Thanks in advance.
You can try:
idx = pd.RangeIndex(len(df.columns[4:])) // 7
out = df.iloc[:, 4:].groupby(idx, axis=1).sum().rename(columns=lambda x:f'Week{x+1}')
out = pd.concat([df.iloc[:, :4], out], axis=1)
print(out)
# Output
Province/State Country/Region Lat ... Week26 Week27 Week28
0 NaN Afghanistan 3.393.911 ... 247210 252460 219855
1 NaN Albania 411.533 ... 28068 32671 32113
2 NaN Algeria 280.339 ... 157675 187224 183841
3 NaN Andorra 425.063 ... 6147 6283 5552
4 NaN Angola -112.027 ... 4741 6341 6978
.. ... ... ... ... ... ... ...
261 NaN Sao Tome and Principe 1.864 ... 5199 5813 5231
262 NaN Yemen 15.552.727 ... 11089 11717 10363
263 NaN Comoros -116.455 ... 2310 2419 2292
264 NaN Tajikistan 38.861 ... 47822 50032 44579
265 NaN Lesotho -29.61 ... 2259 3011 3922
[266 rows x 32 columns]
You can use the melt method to combine all your date columns into a single 'Date' column:
df = df.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='Date', value_name='Value')
From this point it should be straightforward to group by the 'Date' column by week, and then unstack it if you want to have it as multiple columns again.
The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?
Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0
I have a dataframe created from the dictionary below -
d = {
'Region':[
'north','north','north','north','south',
'south','south','east','east','east',
'east','west','west','west'
],
'Store No':[ 1,2,3,4,5,6,7,8,9,10,11,12,13,14],
'Sales':[196, 193, 176, 168, 165, 163, 166, 135, 151, 108, 119, 176, 132, 107]
}
1) How do I create another dataframe to extract the top 3 stores ("Sales" column) for each region.
2) Assuming the "Regions" column had many more different values (such as Northeast, Northwest,Southwest,etc), how do I create another dataframe to extract the regions that start with "North".
You can use groupby and nlargest functions.
1) Top 3 sales per region:
You can create a dictionary of dataframes, one for each region with top 3 sales:
In [687]: top_3_sales = df.groupby('Region')['Sales'].nlargest(3).reset_index().rename(columns={'level_1': 'Store No'})
In [688]: list_of_regions = df.Region.unique().tolist()
In [691]: dict_of_region_df = {region: top_3_sales.loc[top_3_sales['Region'] == region] for region in list_of_regions}
Then query your dict to have individual dataframes:
In [693]: dict_of_region_df['north']
Out[693]:
Region Store No Sales
3 north 0 196
4 north 1 193
5 north 2 176
In [694]: dict_of_region_df['east']
Out[694]:
Region Store No Sales
0 east 8 151
1 east 7 135
2 east 10 119
2.) Regions with north:
In [681]: df[df.Region.str.startswith('north')]
Out[681]:
Region Store No Sales
0 north 1 196
1 north 2 193
2 north 3 176
3 north 4 168
For question 1, use the nlargest function on dataframe.
In [13]: df_1 = d.groupby('Region')['Sales'].nlargest(3)
In [14]: df_1
Out[14]:
Region
east 8 151
7 135
10 119
north 0 196
1 193
2 176
south 6 166
4 165
5 163
west 11 176
12 132
13 107
Name: Sales, dtype: int64
For second question, you can use the startswith for find region starting with north.
In [11]: df_2 = d[d['Region'].str.startswith('north')]
In [12]: df_2
Out[12]:
Region Store No Sales
0 north 1 196
1 north 2 193
2 north 3 176
3 north 4 168
How do I filter pivot tables to return specific columns. Currently my dataframe is this:
print table
sum
Sex Female Male All
Date (Intervals)
April 166 191 357
August 212 263 475
December 173 263 436
February 192 298 490
January 148 195 343
July 189 260 449
June 165 238 403
March 165 278 443
May 236 253 489
November 167 247 414
October 185 287 472
September 175 306 481
All 2173 3079 5252
I want to display results of only the male column. I tried the following code:
table.query('Sex == "Male"')
However I got this error
TypeError: Expected tuple, got str
How would I be able to filter my table with specified rows or columns.
It looks like table has a column MultiIndex:
sum
Sex Female Male All
One way to check if your table has a column MultiIndex is to inspect table.columns:
In [178]: table.columns
Out[178]:
MultiIndex(levels=[['sum'], ['All', 'Female', 'Male']],
labels=[[0, 0, 0], [1, 2, 0]],
names=[None, 'sex'])
To access a column of table you need to specify a value for each level of the MultiIndex:
In [179]: list(table.columns)
Out[179]: [('sum', 'Female'), ('sum', 'Male'), ('sum', 'All')]
Thus, to select the Male column, you would use
In [176]: table[('sum', 'Male')]
Out[176]:
date
April 42.0
August 34.0
December 32.0
...
Since the sum level is unnecessary, you could get rid of it by specifying the values parameter when calling df.pivot or df.pivot_table.
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
# sex Female Male All
# date
# April 40.0 40.0 80.0
# August 48.0 32.0 80.0
# December 48.0 44.0 92.0
For example,
import numpy as np
import pandas as pd
import calendar
np.random.seed(2016)
N = 1000
sex = np.random.choice(['Male', 'Female'], size=N)
date = np.random.choice(calendar.month_name[1:13], size=N)
df = pd.DataFrame({'sex':sex, 'date':date, 'sum':1})
# This reproduces a table similar to yours
table = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True)
print(table[('sum', 'Male')])
# table2 has a single level Index
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
print(table2['Male'])
Another way to remove the sum level would be to use table = table['sum'],
or table.columns = table.columns.droplevel(0).