Use the header as axis and first column as axis - python

I have the following table called totalData, printing totalData will display the following :
Region Q1 Q2 Q3 Q4
0 West 1 5.2 3.1 2.05
1 Center 3.1 1.2 1.2 3
2 East 1.9 4.1 1.1 5.3
I'd like to use a bar to compare changes through quarters per region and use a 4 bars X section per region to display it .
I'd like to use only the numerical data, and display the region as my X axis and the Quarter as my Y axis.
I've tried to write :
totalData.hist(kind='bar')
but it ignores the Region and the Quarter and gives me the numerical column as my X axis(how do i get rid of this column?) and integer values until 6 (< than my highest value at the table)
How could I use Region and Quarter as my axis values?

This is really simple. You have two options:
Set Region as the index of the database
pass x='Region' to the plot method.
Method 1:
from io import StringIO
import matplotlib.pyplot as plt
import pandas
data = StringIO("""\
Region Q1 Q2 Q3 Q4
West 1 5.2 3.1 2.05
Center 3.1 1.2 1.2 3
East 1.9 4.1 1.1 5.3
""")
df = pandas.read_table(data, sep='\s+')
df = df.set_index('Region')
df.plot(kind='bar')
Method 2:
from io import StringIO
import matplotlib.pyplot as plt
import pandas
data = StringIO("""\
Region Q1 Q2 Q3 Q4
West 1 5.2 3.1 2.05
Center 3.1 1.2 1.2 3
East 1.9 4.1 1.1 5.3
""")
df = pandas.read_table(data, sep='\s+')
df.plot(kind='bar', x='Region')
Both give me:

Related

Sqlite3 Issue, Column title not accepted

Type
Location
2019_perc
2020_perc
2021_perc
2022_perc
0
County
Crawford
1.55
1.85
1.1
1.1
1
County
Deck
0.8
1.76
3
2.5
2
City
Peoria
1.62
1.64
0.94
2.2
I have some data that's in a Dataframe with the above format. I'm accessing it using sqlite3 and using matplotlib to graph the data. I am trying to compare employee raises with the yearly CPI(one section of the bar chart with 2019 percentages for each location and the CPI that year, another for 2020, 2021, and 2022). To do so I'd like to create bins by year, so the table would look more like this:
Year
Crawford
Deck
Peoria
0
2019
1.55
0.8
1.62
1
2020
1.85
1.76
1.64
2
2021
1.1
3
0.94
3
2022
1.1
2.5
2.2
Is there any easy way to do this using pandas queries/sqlite3?
Assuming (df) is your dataframe, here is one way to do it :
out = (
df
.drop("Type", axis=1)
.set_index("Location")
.pipe(lambda df_: df_.set_axis(df_.columns.str[:4], axis=1))
.transpose()
.reset_index(names="Year")
.rename_axis(None, axis=1)​
)
Output :
print(out)
Year Crawford Deck Peoria
0 2019 1.55 0.80 1.62
1 2020 1.85 1.76 1.64
2 2021 1.10 3.00 0.94
3 2022 1.10 2.50 2.20
Plot (with pandas.DataFrame.plot.bar):
out.set_index("Year").plot.bar();
Consider melt + pivot:
Data
from io import StringIO
import pandas as pd
txt = '''\
Type Location 2019_perc 2020_perc 2021_perc 2022_perc
0 County Crawford 1.55 1.85 1.1 1.1
1 County Deck 0.8 1.76 3 2.5
2 City Peoria 1.62 1.64 0.94
'''
with StringIO(txt) as f:
cpi_raw_df = pd.read_csv(f, sep="\s+")
Reshape
cpi_df = (
cpi_raw_df.melt(
id_vars = ["Type", "Location"],
var_name = "Year",
value_name = "perc"
).assign(
Year = lambda df: df["Year"].str.replace("_perc", "")
).pivot(
index = "Year",
columns = "Location",
values = "perc"
)
)
print(cpi_df)
# Location Crawford Deck Peoria
# Year
# 2019 1.55 0.80 1.62
# 2020 1.85 1.76 1.64
# 2021 1.10 3.00 0.94
# 2022 1.10 2.50 NaN
Plot
import matplotlib.pyplot as plt
import seaborn as sns
...
sns.set()
cpi_df.plot(kind="bar", rot=0)
plt.show()
plt.clf()
plt.close()

Reshaping a dataframe every nth column

I have two datasets. After merging them horzontally, and sorting the columns with the following code, I get the dataset below:
df=
X
Y
5.2
6.5
3.3
7.6
df_year=
X
Y
2014
2014
2015
2015
df_all_cols = pd.concat([df, df_year], axis = 1)
sorted_columns = sorted(df_all_cols.columns)
df_all_cols_sort = df_all_cols[sorted_columns]
X
X
Y
Y
5.2
2014
6.5
2014
3.3
2015
7.6
2015
I am trying to make my data look like this, by stacking the dataset every 2 columns.
name
year
Variable
5.2
2014
X
3.3
2015
X
6.5
2014
Y
7.6
2015
Y
One approach could be as follows:
Apply df.stack to both dfs before feeding them to pd.concat. The result at this stage being:
0 1
0 X 5.2 2014
Y 6.5 2014
1 X 3.3 2015
Y 7.6 2015
Next, use df.sort_index to sort on the original column names (i.e. "X, Y", now appearing as index level 1), and get rid of index level 0 (df.droplevel).
Finally, use df.reset_index with drop=False to insert index as a column and rename all the columns with df.rename.
res = (pd.concat([df.stack(),df_year.stack()], axis=1)
.sort_index(level=1)
.droplevel(0)
.reset_index(drop=False)
.rename(columns={'index':'Variable',0:'name',1:'year'})
)
# change the order of cols
res = res.iloc[:, [1,2,0]]
print(res)
name year Variable
0 5.2 2014 X
1 3.3 2015 X
2 6.5 2014 Y
3 7.6 2015 Y

Shifting Row values Upwards/Downwards and Replacing Empty Cells with Preceding or Succeeding Values in Pandas DataFrame

I have a data frame with columns containing different country values, I would like to have a function that shifts the rows in this dataframe independently without the dates. For example, I have a list of related profile shifters for each country which would be used in shifting the rows.
If the profile shifter for a country is -3, that country column, is shifted 3 times downwards, while the last 3 values become the first 3 values in the dataframe. If a profile shifter is +3, the third value of a row is shifted upwards while the first 2 values become the last values in that column.
After the rows have been shifted instead of having the default Nan value appear in the empty cells, I want the preceding or succeeding values to take up the empty cells. The function should also return a data frame Sample-dataset Profile Shifter Expected-results.
Sample Dataset:
Datetime ARG AUS BRA
1/1/2050 0.00 0.1 2.1 3.1
1/1/2050 1.00 0.2 2.2 3.2
1/1/2050 2.00 0.3 2.3 3.3
1/1/2050 3.00 0.4 2.4 3.4
1/1/2050 4.00 0.5 2.5 3.5
1/1/2050 5.00 0.6 2.6 3.6
Country Profile Shifters:
Country ARG AUS BRA
UTC -3 -2 4
Desired Output:
Datetime ARG AUS BRA
1/1/2050 0.00 0.3 2.4 3.4
1/1/2050 1.00 0.4 2.5 3.5
1/1/2050 2.00 0.5 2.1 3.1
1/1/2050 3.00 0.1 2.2 3.2
1/1/2050 4.00 0.2 2.3 3.3
This is what I have been trying for days now but it's not working
cols = df1.columns
for i in cols:
if i == 'ARG':
x = df1.iat[0:3,0]
df1['ARG'] = df1.ARG.shift(periods=-3)
df1['ARG'].replace(to_replace=np.nan, x)
elif i == 'AUS':
df1['AUS'] = df1.AUS.shift(periods=2)
elif i == 'BRA':
df1['BRA'] = df1.BRA.shift(periods=1)
else:
pass
This works but is far from being 'good pandas'. I hope that someone will come along and give a nicer, cleaner 'more pandas' answer.
Imports used:
import pandas as pd
import datetime as datetime
Offset data setup:
offsets = pd.DataFrame({"Country" : ["ARG", "AUS", "BRA"], "UTC Offset" : [-3, -2, 4]})
Produces:
Country UTC Offset
0 ARG -3
1 AUS -2
2 BRA 4
Note that the timezone offset data I've used here is in a slightly different structure from the example data (country codes by rows, rather than columns). Also worth pointing out that Australia and Brazil have several time zones, so there is no one single UTC offset which applies to those whole countries (only one in Argentina though).
Sample data setup:
sampleDf = pd.DataFrame()
for i in range(6):
dt = datetime.datetime(2050,1,1,i)
sampleDf = sampleDf.append({'Datetime' : dt,
'ARG' : i / 10,
'AUS' : (i + 10)/ 10,
'BRA' : (i + 20) / 10},
ignore_index=True)
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.0 2.0
1 2050-01-01 01:00:00 0.1 1.1 2.1
2 2050-01-01 02:00:00 0.2 1.2 2.2
3 2050-01-01 03:00:00 0.3 1.3 2.3
4 2050-01-01 04:00:00 0.4 1.4 2.4
5 2050-01-01 05:00:00 0.5 1.5 2.5
Code to shift cells:
for idx, offsetData in offsets.iterrows(): # See note 1
countryCode = offsetData["Country"]
utcOffset = offsetData["UTC Offset"]
dfRowCount = sampleDf.shape[0]
wrappedOffset = (dfRowCount + utcOffset) if utcOffset < 0 else \
(-dfRowCount + utcOffset) # See note 2
countryData = sampleDf[countryCode]
sampleDf[countryCode] = pd.concat([countryData.shift(utcOffset).dropna(),
countryData.shift(wrappedOffset).dropna()]).sort_index() # See note 3
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.4 2.4
1 2050-01-01 01:00:00 0.1 1.5 2.5
2 2050-01-01 02:00:00 0.2 1.0 2.0
3 2050-01-01 03:00:00 0.3 1.1 2.1
4 2050-01-01 04:00:00 0.4 1.2 2.2
5 2050-01-01 05:00:00 0.5 1.3 2.3
Notes
Iterating over rows in pandas like this (to me) indicates 'you've run out of pandas skill, and are kind of going against the design of pandas'. What I have here works, but it won't benefit from any/many of the efficiencies of using pandas, and would not be appropriate for a large dataset. Using itertuples rather than iterrows is supposed to be quicker, but I think neither is great, so I went with what seemed most readable for this case.
This solution does two shifts, one of the data shifted by the timezone offset, then a second shift of everything else to fill in what would otherwise be NaN holes left by the first shift. This line calculates the size of that second shift.
Finally, the results of the two shifts are concatenated together (after dropping any NaN values from both of them) and assigned back to the original (unshifted) column. sort_index puts them back in order based on the index, rather than having the two shifted parts one-after-another.

Comparing the dataframe contents and changing the column color if condition is not met

I have a data frame as shown below. I need to compare min with spec_min and 'max' with spec_max.
If max>Spec_max then the color of the max cell should be read and if min<Spec_min then that also needs to be red. May I know how to do this?
min max SPEC_MIN SPEC_MAX
V_PTAT3V3[V] 1.124 1.14 1.095 1.2
You may. Here is an example. Assuming your dataframe looks somewhat like this
min max spec_min spec_max
0 1.298092 0.857875 1.0 1.2
1 1.814168 1.032747 0.8 1.0
2 1.396925 1.092014 1.0 1.2
3 1.616848 1.279176 0.8 1.0
4 1.956991 1.200024 1.0 1.2
5 1.649614 1.203371 1.0 1.2
6 1.195811 0.432663 1.2 1.4
7 1.313263 0.795951 1.2 1.4
8 1.157487 1.235014 1.0 1.2
9 1.546830 1.094696 1.2 1.4
10 1.135896 0.792172 0.8 1.0
11 1.561299 0.763911 1.2 1.4
12 1.324006 0.956222 1.0 1.2
13 1.283233 0.585565 1.0 1.2
14 1.179644 0.983332 1.2 1.4
15 1.696883 1.199471 1.2 1.4
16 1.130002 0.947254 0.8 1.0
17 1.249352 0.865932 1.2 1.4
18 1.365273 0.721204 1.0 1.2
19 1.155129 0.722179 1.2 1.4
20 1.315393 0.590603 0.8 1.0
def highlight_under_spec_min(s, props=''):
return np.where(s < df['spec_min'], props, '')
def highlight_under_spec_max(s, props=''):
return np.where(s > df['spec_max'], props, '')
df.style.apply(highlight_under_spec_min, props='color:white;background-color:red', subset=['min'], axis=0)\
.apply(highlight_under_spec_max, props='color:white;background-color:red', subset=['max'], axis=0)
gives you
If this is not what you want I suggest you give an example with cells you want and don't want colored.
You can probably try something like this.
import pandas as pd
df = pd.DataFrame([{'min': 1.124, 'max': 1.14, 'SPEC_MIN': 1.095, 'SPEC_MAX': 1.2}])
def style_cells(dataframe):
"""This function is to be used with the Panda's apply method to
color the cells based on different conditions."""
# Preparing styles.
conditional_style = 'background-color: red'
default_style = ''
# Comparing values and creating masks.
mask = dataframe['min'] < dataframe['SPEC_MIN']
# Creating a style DataFrame with same indices and columns as in the original DataFrame.
df = pd.DataFrame(default_style, index=dataframe.index, columns=dataframe.columns)
# Modifying cell colors.
df.loc[mask, 'min'] = conditional_style
# The same procedure for max values.
mask = dataframe['max'] > dataframe['SPEC_MAX']
df.loc[mask, 'max'] = conditional_style
return df
df.style.apply(style_cells, axis=None)

Rolling Linear Fit with Python DataFrame

I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7

Categories