In python, I have a dataframe similar to this:
print(transport)
year cycles cars lorries
1993 249 21000 1507
1994 249 21438 1539
1995 257 21817 1581
1996 253 22364 1630
1997 253 22729 1668
I would like to look at the change over time for each form of transport, relative to the earliest value i.e. for 1993. I would like to create a new dataframe based on transport where 1993 is the base year with the value set to 100.0 and all subsequent values be relative to that.
I'm really new to python and I can't figure out how to approach this.
You can divide all the values by the corresponding number from 1993 and multiply by 100.0 to get the results -
df = df.set_index('year')
df / df.loc[1993] * 100.0
# cycles cars lorries
# year
# 1993 100.000000 100.000000 100.000000
# 1994 100.000000 102.085714 102.123424
# 1995 103.212851 103.890476 104.910418
# 1996 101.606426 106.495238 108.161911
# 1997 101.606426 108.233333 110.683477
You can use iterrows:
transport.set_index("year", inplace=True)
y1993 = transport[transport.index == 1993]
for index, row in transport.iterrows():
for column in transport.columns:
transport.loc[index, column] = ((row[column])/(y1993[column].values[0])) * 100
transport
Output
year
cycles
cars
lorries
1993
100
100
100
1994
100
102.086
102.123
1995
103.213
103.89
104.91
1996
101.606
106.495
108.162
1997
101.606
108.233
110.683
Related
I've just started with Pandas and Numpy a couple of months ago and I've learned already quite a lot thanks to all the threads here. But now I can't find what I need.
For work, I have created an excel sheet that calculates some figures to be used for re-ordering inventory. To practice and maybe actually use it, I'd wanted to give it a try to replicate the functionality in Python. Later I might want to add some more sophisticated calculations with the help of Scikit-learn.
So far I've managed to load a csv with sales figures from our ERP into a dataframe, calculate mean and std. The calculations have been done on a subset of the data because I don't know how to apply calculations only to the specific columns. The csv does also contain for example product codes and leadtimes and these should not be used for the average and std calculations. Not sure yet also how to merge this subset back with the original dataframe.
The reason why I didn't hardcode the column names is because the ERP reports the sales number over the past x no. of months, so the order of the columns will change througout the year and I want to keep them in chronological order.
My data from the csv looks like:
"code","leadtime","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"
"001.002",60,299,821,351,614,246,957,968,939,125,368,727,231
"001.002",25,340,274,733,575,904,953,614,268,638,960,617,757
"001.002",130,394,327,435,767,377,699,424,951,972,717,317,264
What I've done so far and what is working fine. (This can be doe probably much easier/more efficient):
import numpy as np
import timeit
import csv
import pandas as pd
sd = 1
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Get no of columns and substract 2 for compcode and leadtime
cols = df.shape[1] - 2
# Create a subset and count the columns
df_subset = df.iloc[:, -cols:]
subset_cols = df_subset.shape[1]
# Add columns for std dev and average
df_subset = (df_subset.assign(mean=df_subset.mean(axis=1),
stddev=df_subset.std(axis=1, ddof=0))
)
# Add columns for min and max values based on mean +/- std multiplied by factor sd
df_subset = (df_subset.assign(minSD=df_subset['mean'].sub(df_subset['stddev'] * sd),
maxSD=df_subset['mean'].add(df_subset['stddev'] * sd))
df_subset
Which gives me:
jan feb mar apr may jun jul aug sep oct nov dec mean stddev minSD maxSD
0 299 821 351 614 246 957 968 939 125 368 727 231 553.833333 304.262998 249.570335 858.096332
1 340 274 733 575 904 953 614 268 638 960 617 757 636.083333 234.519530 401.563804 870.602863
2 394 327 435 767 377 699 424 951 972 717 317 264 553.666667 242.398203 311.268464 796.064870
However for my next calculation I'm stuck again:
I want to calculate the average over values from the "month" columns and only the values that match the condition >= minSD and <= maxSD
So for row 0, I'm looking for the value (299+821+351+614+368+727)/6 = 530
How can I achieve this?
I've tried this, but this doesn't seem to work:
df_subset = df_subset.assign(avgwithSD=df_subset.iloc[:,0:subset_cols].values(where(df_subset.values>=df_subset['minSD'] & df_subset.values>=df_subset['maxSD'])).mean(axis=1))
Some help would be very welcome. Thanks
EDIT: With help I ended up using this to get further with my program
import numpy as np
import timeit
import csv
import pandas as pd
# sd will determine if range will be SD1 or SD2
sd = 1
# file to use
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Function to calculate the mean of the values within the range between minSD and maxSD
def CalcMeanSD(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Define the month/data columns and set them to floatvalues
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
# Add columns for stddev and mean. Based on these values set new range between minSD and maxSD
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# Add column with the mean of the new range
df['avgwithSD'] = np.nanmean(df.apply(CalcMeanSD, axis=1), axis=1)
df
Result is:
code leadtime jan feb mar apr may jun jul aug sep oct nov dec stddev mean minSD maxSD avgwithSD
0 001.002 60 299.0 821.0 351.0 614.0 246.0 957.0 968.0 939.0 125.0 368.0 727.0 231.0 304.262998 553.833333 249.570335 858.096332 530.000000
1 001.002 25 340.0 274.0 733.0 575.0 904.0 953.0 614.0 268.0 638.0 960.0 617.0 757.0 234.519530 636.083333 401.563804 870.602863 655.666667
2 001.002 130 394.0 327.0 435.0 767.0 377.0 699.0 424.0 951.0 972.0 717.0 317.0 264.0 242.398203 553.666667 311.268464 796.064870 495.222222
3 001.002 90 951.0 251.0 411.0 469.0 359.0 220.0 192.0 250.0 818.0 768.0 937.0 128.0 292.572925 479.500000 186.927075 772.072925 365.000000
4 001.002 35 228.0 400.0 46.0 593.0 61.0 293.0 5.0 203.0 850.0 506.0 37.0 631.0 264.178746 321.083333 56.904588 585.262079 281.833333
5 001.002 10 708.0 804.0 208.0 380.0 531.0 125.0 500.0 773.0 354.0 238.0 805.0 215.0 242.371773 470.083333 227.711560 712.455106 451.833333
6 001.002 14 476.0 628.0 168.0 946.0 29.0 324.0 3.0 400.0 981.0 467.0 459.0 571.0 295.814225 454.333333 158.519109 750.147558 436.625000
7 001.002 14 92.0 906.0 18.0 537.0 57.0 399.0 544.0 977.0 909.0 687.0 881.0 459.0 333.154577 538.833333 205.678756 871.987910 525.200000
8 001.002 90 487.0 634.0 5.0 918.0 158.0 447.0 713.0 459.0 465.0 643.0 482.0 672.0 233.756447 506.916667 273.160220 740.673113 555.777778
9 001.002 130 741.0 43.0 976.0 461.0 35.0 321.0 434.0 8.0 330.0 32.0 896.0 531.0 326.216782 400.666667 74.449885 726.883449 415.400000
EDIT:
Instead of your original code:
# first part:
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# second part: (the one that doesn't work for you)
def calc_mean_per_row_by_condition(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
df['avgwithSD'] = np.nanmean(df.apply(calc_mean_per_row_by_condition, axis=1), axis=1)
I have a mass pandas DataFrame df:
year count
1983 5
1983 4
1983 7
...
2009 8
2009 11
2009 30
and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.
I want to randomly sample 10 data per year, which can be done by:
new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()
for i in range(len(ref)):
appended_df = df[df['year'] == ref[i]].sample(n=10)
new_df = pd.concat([new_df,appended_df])
Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:
vlist = []
for i in range(len(new_df)):
if randint(0,1) == 0:
vlist.append(new_df.count.iloc[i])
else:
vlist.append(new_df.count.iloc[i] * -1)
new_data['value'] = vlist
Getting a mean and standard deviation per each year is quite simple:
xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()
But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.
Essentially, the output should be in the form of the following (the values are arbitrary here):
year mean_of_100_means total_sd
1983 4.22 0.43
1984 -6.39 1.25
1985 2.01 0.04
...
2007 11.92 3.38
2008 -5.27 1.67
2009 1.85 0.99
Any insights would be appreciated.
Try:
def fn(x):
_100_means = [x.sample(10).mean() for i in range(100)]
return {
"mean_of_100_means": np.mean(_100_means),
"total_sd": np.std(_100_means),
}
print(df.groupby("year")["count"].apply(fn).unstack().reset_index())
EDIT: Changed the computation of means.
Prints:
year mean_of_100_means total_sd
0 1983 48.986 8.330787
1 1984 48.479 10.384896
2 1985 48.957 7.854900
3 1986 50.821 10.303847
4 1987 50.198 9.835832
5 1988 47.497 8.678749
6 1989 46.763 9.197387
7 1990 49.696 8.837589
8 1991 46.979 8.141969
9 1992 48.555 8.603597
10 1993 50.220 8.263946
11 1994 48.735 9.954741
12 1995 49.759 8.532844
13 1996 49.832 8.998654
14 1997 50.306 9.038316
15 1998 49.513 9.024341
16 1999 50.532 9.883166
17 2000 49.195 9.177008
18 2001 50.731 8.309244
19 2002 48.792 9.680028
20 2003 50.251 9.384759
21 2004 50.522 9.269677
22 2005 48.090 8.964458
23 2006 49.529 8.250701
24 2007 47.192 8.682196
25 2008 50.124 9.337356
26 2009 47.988 8.053438
The dataframe was created:
data = []
for y in range(1983, 2010):
for i in np.random.randint(0, 100, size=1000):
data.append({"year": y, "count": i})
df = pd.DataFrame(data)
I think you can use pandas groupby and sample functions together to take 10 samples from each year of your DataFrame. If you put this in a loop, then you can sample it 100 times, and combine the results.
It sounds like you only need the standard deviation of the 100 means (and you don't need the standard deviation of the sample of 10 observations), so you can calculate only the mean in your groupby and sample, then calculate the standard deviation from each of those 100 means when you are creating the total_sd column of your final DataFrame.
import numpy as np
import pandas as pd
np.random.seed(42)
## create a random DataFrame with 100 entries for the years 1980-1999, length 2000
df = pd.DataFrame({
'year':[year for year in list(range(1980, 2000)) for _ in range(100)],
'count':np.random.randint(1,100,size=2000)
})
list_of_means = []
## sample 10 observations from each year, and repeat this process 100 times, storing the mean for each year in a list
for _ in range(100):
df_sample = df.groupby("year").sample(10).groupby("year").mean()
list_of_means.append(df_sample['count'].tolist())
array_of_means = [np.array(x) for x in list_of_means]
result = pd.DataFrame({
'year': df.year.unique(),
'mean_of_100_means': [np.mean(k) for k in zip(*array_of_means)],
'total_sd': [np.std(k) for k in zip(*array_of_means)]
})
This results in:
>>> result
year mean_of_100_means total_sd
0 1980 50.316 8.656948
1 1981 48.274 8.647643
2 1982 47.958 8.598455
3 1983 49.357 7.854620
4 1984 48.977 8.523484
5 1985 49.847 7.114485
6 1986 47.338 8.220143
7 1987 48.106 9.413085
8 1988 53.487 9.237561
9 1989 47.376 9.173845
10 1990 46.141 9.061634
11 1991 46.851 7.647189
12 1992 49.389 7.743318
13 1993 52.207 9.333309
14 1994 47.271 8.177815
15 1995 52.555 8.377355
16 1996 47.606 8.668769
17 1997 52.584 8.200558
18 1998 51.993 8.695232
19 1999 49.054 8.178929
We are discussing data that is imported from excel
ene2 = pd.read_excel('Energy Indicators.xls', index=False)
recently I asked in post, where answers were clear, straightforward and brought success.
Changing Values of elements in Pandas Datastructure
However I went steps further, and I have similar (sic!) problem, where assigning variable does not change anything.
Lets consider Data Structure
print(ene2.head())
Country Energy Supply Energy Supply per Capita % Renewable's
15 NaN Gigajoules Gigajoules %
16 Afghanistan 321000000 10 78.6693
17 Albania 102000000 35 100
18 Algeria1 1959000000 51 0.55101
19 American Samoa ... ... 0.641026
238 Viet Nam 2554000000 28 45.3215
239 Wallis and Futuna Islands 0 26 0
240 Yemen 344000000 13 0
241 Zambia 400000000 26 99.7147
242 Zimbabwe 480000000 32 52.5361
243 NaN NaN NaN NaN
244 NaN NaN NaN NaN
where some countries have index (like Algieria1 or Australia12)
I want to change those names to become just Algieria, Australia and so on.
There is in total 20 entries that suppose to be changed.
I developed a method to do it, which at the last step fails..
for value in ene2['Country']:
if type(value) == float: # to cover NaN values
continue
x = re.findall("\D+\d", value) # to find those countries/elements which are with number
while len(x) > 0: # this shows elements with number, otherwise answer is [], which is 0
for letters in x: # to touch letters
right = letters[:-1] # and get rid of the last number
ene2.loc[ene2['Country'] == value, 'Country'] = right # THIS IS ELEMENT WHICH FAILS <= it does not chagne the value
x = re.findall("\D+\d", value) # to bring the new value to the while loop
Code above should make the task, to finally remove all the indexes from the names,
however the code - ene2.loc[...] which used to work previously, here, where is nested, just do nothing.
What could be the case that this exchange does not work, how can I overcome the problem a) in a old style way b) in the Panda way?
The code suggest you already use pandas, so why not use the built-in replace method with regex?
df = pd.DataFrame(data=["Afghanistan","Albania", "Algeria1", "Algeria9999"], columns=["Country"])
df["Country_clean"] = df["Country"].str.replace(r'\d+$', '')
output:
print(df["Country_clean"])
0 Afghanistan
1 Albania
2 Algeria
3 Algeria
Name: Country, dtype: object
I have a dataframe named growth with 4 columns.
State Name Average Fare ($)_x Average Fare ($)_y Average Fare ($)
0 AK 599.372368 577.790640 585.944324
1 AL 548.825867 545.144447 555.939466
2 AR 496.033146 511.867026 513.761296
3 AZ 324.641818 396.895324 389.545267
4 CA 368.937971 376.723839 366.918761
5 CO 502.611572 537.206439 531.191893
6 CT 394.105453 388.772428 370.904182
7 DC 390.872738 382.326510 392.394165
8 FL 324.941100 329.728524 337.249248
9 GA 485.335737 480.606365 489.574241
10 HI 326.084793 335.547369 298.709998
11 IA 428.151682 445.625840 462.614195
12 ID 482.092567 475.822275 491.714945
13 IL 329.449503 349.938794 346.022226
14 IN 391.627917 418.945137 412.242053
15 KS 452.312058 490.024059 420.182836
The last three columns are the average fare of each year of each state.
2nd,3rd,4th column being year 2017,2018,2019 respectively.
I wanted to find out that which state has highest growth in fare since 2017.
I tried with this code of mine and it gives some output that I cant really understand.
I just need to find the state that has highest fare growth since 2017.
my code:
growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change()
You can you this
df.set_index('State_name').pct_change(periods = 1, axis='columns').idxmax()
Change the periods value to 2 if you want to calculate the difference between first year & the 3rd year.
output
Average_fare_x NaN
Average_fare_y AZ #state with max change between 1st & 2nd year
Average_fare WV #state with max change between 2nd & 3rd year
growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change(axis='columns')
This should give you the percentage change between each year.
growth['variation_percentage'] = growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change(axis='columns').sum(axis=1)
This should give you the cumulative percentage change.
Since you are talking about variation prices the total growth/decrease in fare will be the variation from 2017 to your last available data (2019). Therefore you can compute this ratio and then just get the max() to find the row with the most growth.
growth['variation_fare'] = growth['Average Fare ($)'] / growth['Average Fare ($)_x']
growth = growth.sort_values(['variation_fare'],ascending=False)
print(growth.head(1))
Example:
import pandas as pd
a = {'State':['AK','AL','AR','AZ','CA'],'2017':[100,200,300,400,500],'2018':[120,242,324,457,592],'2019':[220,393,484,593,582]}
growth = pd.DataFrame(a)
growth['2018-2017 variation'] = (growth['2018'] / growth['2017']) - 1
growth['2019-2018 variation'] = (growth['2019'] / growth['2018']) - 1
growth['total variation'] = (growth['2019'] / growth['2017']) - 1
growth = growth.sort_values(['total variation'],ascending=False)
print(growth.head(5)) #Showing top 5
Output:
State 2017 2018 2019 2018-2017 variation 2019-2018 variation total variation
0 AK 100 120 220 0.2000 0.833333 1.200000
1 AL 200 242 393 0.2100 0.623967 0.965000
2 AR 300 324 484 0.0800 0.493827 0.613333
3 AZ 400 457 593 0.1425 0.297593 0.482500
4 CA 500 592 582 0.1840 -0.016892 0.164000
I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.
I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))