Type
Location
2019_perc
2020_perc
2021_perc
2022_perc
0
County
Crawford
1.55
1.85
1.1
1.1
1
County
Deck
0.8
1.76
3
2.5
2
City
Peoria
1.62
1.64
0.94
2.2
I have some data that's in a Dataframe with the above format. I'm accessing it using sqlite3 and using matplotlib to graph the data. I am trying to compare employee raises with the yearly CPI(one section of the bar chart with 2019 percentages for each location and the CPI that year, another for 2020, 2021, and 2022). To do so I'd like to create bins by year, so the table would look more like this:
Year
Crawford
Deck
Peoria
0
2019
1.55
0.8
1.62
1
2020
1.85
1.76
1.64
2
2021
1.1
3
0.94
3
2022
1.1
2.5
2.2
Is there any easy way to do this using pandas queries/sqlite3?
Assuming (df) is your dataframe, here is one way to do it :
out = (
df
.drop("Type", axis=1)
.set_index("Location")
.pipe(lambda df_: df_.set_axis(df_.columns.str[:4], axis=1))
.transpose()
.reset_index(names="Year")
.rename_axis(None, axis=1)
)
Output :
print(out)
Year Crawford Deck Peoria
0 2019 1.55 0.80 1.62
1 2020 1.85 1.76 1.64
2 2021 1.10 3.00 0.94
3 2022 1.10 2.50 2.20
Plot (with pandas.DataFrame.plot.bar):
out.set_index("Year").plot.bar();
Consider melt + pivot:
Data
from io import StringIO
import pandas as pd
txt = '''\
Type Location 2019_perc 2020_perc 2021_perc 2022_perc
0 County Crawford 1.55 1.85 1.1 1.1
1 County Deck 0.8 1.76 3 2.5
2 City Peoria 1.62 1.64 0.94
'''
with StringIO(txt) as f:
cpi_raw_df = pd.read_csv(f, sep="\s+")
Reshape
cpi_df = (
cpi_raw_df.melt(
id_vars = ["Type", "Location"],
var_name = "Year",
value_name = "perc"
).assign(
Year = lambda df: df["Year"].str.replace("_perc", "")
).pivot(
index = "Year",
columns = "Location",
values = "perc"
)
)
print(cpi_df)
# Location Crawford Deck Peoria
# Year
# 2019 1.55 0.80 1.62
# 2020 1.85 1.76 1.64
# 2021 1.10 3.00 0.94
# 2022 1.10 2.50 NaN
Plot
import matplotlib.pyplot as plt
import seaborn as sns
...
sns.set()
cpi_df.plot(kind="bar", rot=0)
plt.show()
plt.clf()
plt.close()
Related
I have come across some strange behavior in Pandas groupby-apply that I am trying to figure out.
Take the following example dataframe:
import pandas as pd
import numpy as np
index = range(1, 11)
groups = ["A", "B"]
idx = pd.MultiIndex.from_product([index, groups], names = ["index", "group"])
np.random.seed(12)
df = pd.DataFrame({"val": np.random.normal(size=len(idx))}, index=idx).reset_index()
print(df.tail().round(2))
index group val
15 8 B -0.12
16 9 A 1.01
17 9 B -0.91
18 10 A -1.03
19 10 B 1.21
And using this framework (which allows me to execute any arbitrary function within a groupby-apply):
def add_two(x):
return x + 2
def pd_groupby_apply(df, out_name, in_name, group_name, index_name, function):
def apply_func(df):
if index_name is not None:
df = df.set_index(index_name).sort_index()
df[out_name] = function(df[in_name].values)
return df[out_name]
return df.groupby(group_name).apply(apply_func)
Whenever I call pd_groupby_apply with the following inputs, I get a pivoted DataFrame:
df_out1 = pd_groupby_apply(df=df,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out1.head().round(2))
index 1 2 3 4 5 6 7 8 9 10
group
A 2.47 2.24 2.75 2.01 1.19 1.40 3.10 3.34 3.01 0.97
B 1.32 0.30 0.47 1.88 4.87 2.47 0.78 1.88 1.09 3.21
However, as soon as my dataframe does not contain full group-index pairs, and I call my pd_groupby_apply function again, I do recieve my dataframe back in the way that I want (i.e. not pivoted):
df_notfull = df.iloc[:-1]
df_out2 = pd_groupby_apply(df=df_notfull,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out2.head().round(2))
group index
A 1 2.47
2 2.24
3 2.75
4 2.01
5 1.19
Why is this? And more importantly, how can I prevent Pandas from pivoting my dataframe when I have full index-group pairs in my dataframe?
I have df like below I want to create dayshigh column. This column will show the row counts until the highest date.
date high
05-06-20 1.85
08-06-20 1.88
09-06-20 2
10-06-20 2.11
11-06-20 2.21
12-06-20 2.17
15-06-20 1.99
16-06-20 2.15
17-06-20 16
18-06-20 9
19-06-20 14.67
should be like:
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16 8
18-06-20 9 0
19-06-20 14.67 1
using the below code but showing error somehow:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
for j in range(df["DaysHigh"][i].index, len(df)):
if df["high"][i] > df["high"][i-1]:
df["DaysHigh"][i] = df["DaysHigh"][i-1] + 1
else:
df["DaysHigh"][i] = 0
At which point am I doing wrong? Thank you
Is the dayshigh number for 17-06-20 supposed to be 2 instead of 8? If so, you can basically use the code you had already written here. There are three changes I'm making below:
starting i from 1 instead of 0 to avoid trying to access the -1th element
removing the loop over j (doesn't seem to be necessary)
using loc to set the values instead of df["high"][i] -- you'll see this should resolve the warnings about copies and slices.
Keeping first line same as before,
for i in range(1, len(df)):
if df["high"][i] > df["high"][i-1]:
df.loc[i,"DaysHigh"] = df["DaysHigh"][i-1] + 1
else:
df.loc[i,"DaysHigh"] = 0
procedure
Use pandas.shift() to create a column for the next row of comparison results.
calculate the cumulative sum of its created columns
delete the columns if they are not needed
df['tmp'] = np.where(df['high'] >= df['high'].shift(), 1, np.NaN)
df['dayshigh'] = df['tmp'].groupby(df['tmp'].isna().cumsum()).cumsum()
df.drop('tmp', axis=1, inplace=True)
df
date high dayshigh
0 05-06-20 1.85 NaN
1 08-06-20 1.88 1.0
2 09-06-20 2.00 2.0
3 10-06-20 2.11 3.0
4 11-06-20 2.21 4.0
5 12-06-20 2.17 NaN
6 15-06-20 1.99 NaN
7 16-06-20 2.15 1.0
8 17-06-20 16.00 2.0
9 18-06-20 9.00 NaN
10 19-06-20 14.67 1.0
Well, I think I did, here is my solution:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
#for i in range(len(df)-1000, len(df)):
for j in reversed(range(i)):
if df["high"][i] > df["high"][j]:
df["DaysHigh"][i] = df["DaysHigh"][i] + 1
else:
break
print(df)
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2.00 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16.00 8
18-06-20 9.00 0
19-06-20 14.67 1
I have the following code that generates four columns as I intended
df['revenue'] = pd.to_numeric(df['revenue']) #not exactly sure what this does
df['Date'] = pd.to_datetime(df['Date'], unit='s')
df['Year'] = df['Date'].dt.year
df['First Purchase Date'] = pd.to_datetime(df['First Purchase Date'], unit='s')
df['number_existing_customers'] = df.groupby(df['Year'])[['Existing Customer']].sum()
df['number_new_customers'] = df.groupby(df['Year'])[['New Customer']].sum()
df['Rate'] = df['number_new_customers']/df['number_existing_customers']
Table = df.groupby(df['Year'])[['New Customer', 'Existing Customer', 'Rate', 'revenue']].sum()
print(Table)
I want to be able to divide one column by another (new customers by existing) but I seem to be getting zeros when creating the new column (see output below).
>>> print(Table)
New Customer Existing Customer Rate revenue
Year
2014 7.00 2.00 0.00 11,869.47
2015 1.00 3.00 0.00 9,853.93
2016 5.00 3.00 0.00 4,058.53
2017 9.00 3.00 0.00 8,056.37
2018 12.00 7.00 0.00 22,031.23
2019 16.00 10.00 0.00 97,142.42
All you need to do is define the column and then use the corresponding operator, in this case /:
Table['Rate'] = Table['New customer']/Table['Existing customer']
In this example I'm copying your Table output and using the code I've posted:
import pandas as pd
import numpy as np
data = {'Year':[2014,2015,2016,2017,2018,2019],'New customer':[7,1,5,9,12,16],'Existing customer':[2,3,3,3,7,10],'revenue':[1000,1000,1000,1001,1100,1200]}
Table = pd.DataFrame(data).set_index('Year')
Table['Rate'] = Table['New customer']/Table['Existing customer']
print(Table)
Output:
New customer Existing customer revenue Rate
Year
2014 7 2 1000 3.500000
2015 1 3 1000 0.333333
2016 5 3 1000 1.666667
2017 9 3 1001 3.000000
2018 12 7 1100 1.714286
2019 16 10 1200 1.600000
I have the foll. csv file:
RUN YR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Here, the column 'YR' has the values 2008, 2009...2013. However, there is no space between the values for YR and values for RUN. Because of this, when I try to read in the dataframe, it does not read the YR column correctly.
pandas.read_csv('file.csv', skipinitialspace=True, usecols=['YR','PMTE'], sep=' ')
The line above reads in the AP15 column instead of YR. How do I fix this?
It seems like your 'csv' is really a fixed-width format file. Sometimes these are accompanied by another file listing the size of each column, but maybe you aren't that lucky, and have to count the column widths manually. You can read this file with pandas's fixed width reading function:
df = pd.read_fwf('fixed_width.txt', widths=[4, 4, 8, 8])
In [7]: df
Out[7]:
RUN YR AP15 PMTE
0 1 2008 4.53 0.04
1 1 2009 3.17 0.26
2 1 2010 6.20 1.38
3 1 2011 5.38 3.55
4 1 2012 7.32 6.13
5 1 2013 4.39 9.40
In [8]: df.columns
Out[8]: Index(['RUN', 'YR', 'AP15', 'PMTE'], dtype='object')
There is an option to find the widths automatically but it probably requires at least a space between each column, as it doesn't seem to work here.
One workaround you can do for this would be to first make the column RUN and YR as one for your csv . Example -
RUNYR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Then read the csv into a dataframe with RUNYR as a string column, and then slice the RUNYR column up to make two different columns using pandas.Series.str.slice method. Example -
df = pd.read_csv('file.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
df['YR'] = df['RUNYR'].str.slice(1).astype(int)
df = df.drop('RUNYR',axis=1)
Demo -
In [21]: df = pd.read_csv('a.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
In [22]: df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
In [23]: df['YR'] = df['RUNYR'].str.slice(1).astype(int)
In [24]: df = df.drop('RUNYR',axis=1)
In [25]: df
Out[25]:
AP15 PMTE RUN YR
0 4.53 0.04 1 2008
1 3.17 0.26 1 2009
2 6.20 1.38 1 2010
3 5.38 3.55 1 2011
4 7.32 6.13 1 2012
5 4.39 9.40 1 2013
And then write this back to your csv using .to_csv method (to fix your csv permanently) .
I have the following table called totalData, printing totalData will display the following :
Region Q1 Q2 Q3 Q4
0 West 1 5.2 3.1 2.05
1 Center 3.1 1.2 1.2 3
2 East 1.9 4.1 1.1 5.3
I'd like to use a bar to compare changes through quarters per region and use a 4 bars X section per region to display it .
I'd like to use only the numerical data, and display the region as my X axis and the Quarter as my Y axis.
I've tried to write :
totalData.hist(kind='bar')
but it ignores the Region and the Quarter and gives me the numerical column as my X axis(how do i get rid of this column?) and integer values until 6 (< than my highest value at the table)
How could I use Region and Quarter as my axis values?
This is really simple. You have two options:
Set Region as the index of the database
pass x='Region' to the plot method.
Method 1:
from io import StringIO
import matplotlib.pyplot as plt
import pandas
data = StringIO("""\
Region Q1 Q2 Q3 Q4
West 1 5.2 3.1 2.05
Center 3.1 1.2 1.2 3
East 1.9 4.1 1.1 5.3
""")
df = pandas.read_table(data, sep='\s+')
df = df.set_index('Region')
df.plot(kind='bar')
Method 2:
from io import StringIO
import matplotlib.pyplot as plt
import pandas
data = StringIO("""\
Region Q1 Q2 Q3 Q4
West 1 5.2 3.1 2.05
Center 3.1 1.2 1.2 3
East 1.9 4.1 1.1 5.3
""")
df = pandas.read_table(data, sep='\s+')
df.plot(kind='bar', x='Region')
Both give me: