How to plot 'group by' in python

How to plot 'group by' in python - python

I have a csv that contains warehouses, dates and quantities (stock). I'm trying to plot the quantities by date for each warehouse (a separate plot by warehouse). I'm a beginner in Python, I've tried looking around but I can't find anything that would solve my problem.
Here's what the table looks like
csv sample
Thanks for your help !

import pandas as pd
data = pd.read_csv("path_to_your_file.csv",header=0)
by_date = data.groupby(["warehouse","date"]).agg(['mean', 'count', 'sum'])
print(by_date)
Something like this is simple and would give you your result. You will need to first install pandas library with pip in the console:
$>pip install pandas
Here you can have some documentation of the library with tutorials and walk-throughs and a cheatsheet of the basics.

If you want to plot the data rather than simply print it, you can do something like the following:
import pandas as pd
df = pd.read_csv("path_to_your_file.csv")
This sho7uld produce a Datafr4ame of the form:
Wharehouse Date Qty
0 A 4/20/2022 485
1 A 4/21/2022 642
2 A 4/22/2022 315
3 A 4/23/2022 845
4 B 4/20/2022 325
5 B 4/21/2022 156
6 B 4/22/2022 851
7 C 4/20/2022 268
8 C 4/21/2022 452
9 C 4/22/2022 265
To plot the data
df.groupby('Wharehouse').plot.bar(x= 'Date', y='Qty')
Yields the following:
Wharehouse
A AxesSubplot(0.125,0.125;0.775x0.755)
B AxesSubplot(0.125,0.125;0.775x0.755)
C AxesSubplot(0.125,0.125;0.775x0.755)
dtype: object

Related

Pandas Dataframe - How to transpose one value for the row n to the row n-5 [duplicate]

I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291

In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8

You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)

Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.

If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index

I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want

Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.

df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852

This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.

I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.

Key Error Trying to plot multiple bar charts from dataframe

Reason why I am loading the df from the .csv is because another file creates the csv and then this file will access it (maybe this is an issue? not sure)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('MAIN_DATAFRAME.csv')
def plot_graph_1(MAIN_DATAFRAME):
df1 = MAIN_DATAFRAME.loc[['Bots']]
df1 = df1.transpose()
df2 = MAIN_DATAFRAME.loc[['Speed']]
df2 = df2.transpose()
df3 = MAIN_DATAFRAME.loc[['Weight']]
df3 = df3.transpose()
df4 = MAIN_DATAFRAME.loc[['Chargers']]
df4 = df4.transpose()
ax = df1.plot(kind='bar')
df2.plot(ax=ax, kind='bar')
df3.plot(ax=ax,kind='bar')
df4.plot(ax=ax, kind='bar')
ax.bar(ax, df1)
plt.show()
plot_graph_1(df)
So I would like to have this Dataframe be plotted and ideally the bar charts will share axis and be different collors so that they can be distinguised when stacked on each other.
btw here is the dataframe:
Run 1
Run 2
Run 3
Run 4
Run 5
Run 6
Run 7
Run 8
Run 9
Run 10
Bots
5
6
7
8
9
10
11
12
13
14
Speed
1791
2359
2996
3593
4105
4551
4631
4656
4672
4674
Weight
612
733
810
888
978
1059
1079
1085
1090
1092
Chargers
10
10
10
10
10
10
10
10
10
10
I tried changing how I access the dataframe values. I also tried changing brackets from: df2 = MAIN_DATAFRAME.loc[['Speed']] to df2 = MAIN_DATAFRAME.loc['Speed'] and still get a key error.

You can transpose the whole DataFrame and then you can plot it like this:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
# Read data from CSV
df = pd.read_csv(
"3.csv",
index_col=0
)
# Define plotting function
def plot_bars_from_df(df: pd.DataFrame) -> plt.Axes:
"""Plot bar chart from DataFrame."""
df = df.transpose()
ax = df.plot(
kind="bar"
)
return ax
# Call function
plot_bars_from_df(df)
You'll get the following output
However, "Bots" and "Charger" are few orders of magnitude smaller than the other columns so it doesn't make much sense to plot them together.

Python compute correlation of a single variable between groups

I'd like to compute the correlation of the variable "hours" between two groups in a panel data. Specifically, I'd like to compute the correlation of hours between groups A and B with group C. So the end result would contain two numbers: corr(hours_A, hours_C), and corr(hours_B, hours_C).
I have tried:
data.groupby('group').corr()
But it gave me the correlation between "hours" and "other variables" within each group, but I want the correlation of just the "hours" variable across two groups. I'm new to Python, so any help is welcome!
group
year
hours
other variables
A
2000
2784
567
A
2001
2724
567
A
2002
2715
567
B
2000
2301
567
B
2001
2612
567
B
2002
2489
567
C
2000
2190
567
C
2001
2139
567
C
2002
2159
567
Update:
Thank you for answering my question!
I eventually figured out some code of my own, but my code is not as elegant as the answers provided. For what it's worth, I'm posting it here.
df = df.set_index(['group','year'])
df = df.unstack(level=0)
df.index = pd.to_datetime(df.index).year
df.columns = df.columns.rename(['variables',"group"])
df.xs('hours', level="variables", axis=1).corr()
Indexing year isn't necessary for the correlation, but if I want to create cross sections of the data later, it might come in handy.

maybe it is not the best way to do it but I believe this will get you on your way.
import pandas as pd
import numpy as np
data = data[['group', 'year', 'hours']]
data_new = data.set_index(['year', 'group']).unstack(['group'])
final_df = pd.DataFrame(data_new.to_numpy(), columns=['A', 'B', 'C'])
final_df.corr()
I will also leave the process to (I think) reproduce your problem for anyone who wishes to give it a try!
import pandas as pd
import numpy as np
data_str = '''A|2000|2784|567
A|2001|2724|567
A|2002|2715|567
B|2000|2301|567
B|2001|2612|567
B|2002|2489|567
C|2000|2190|567
C|2001|2139|567
C|2002|2159|567'''.split('\n')
data = pd.DataFrame([x.split('|') for x in data_str], columns=['group', 'year', 'hours', 'other_variables'])
data['hours'] = data['hours'].astype(int)

You can apply list to the groups and then convert to Series, transpose, and then call corr() on the data.
from io import StringIO
import pandas as pd
>>> data = StringIO("""group,year,hours,other,variables
A,2000,2784,567
A,2001,2724,567
A,2002,2715,567
B,2000,2301,567
B,2001,2612,567
B,2002,2489,567
C,2000,2190,567
C,2001,2139,567
C,2002,2159,567""")
>>> df = pd.read_csv(data)
>>> df.groupby('group')['hours'].apply(list).apply(pd.Series).T.corr()
0 1 2
0 1.000000 0.771752 0.898470
1 0.771752 1.000000 0.972589
2 0.898470 0.972589 1.000000
How does this work? The groupby + apply(list) produces the following, which is a Series with three rows, each being a list of three items.
A [2784, 2724, 2715]
B [2301, 2612, 2489]
C [2190, 2139, 2159]
The apply(pd.Series) converts the list in each row to a series. You then have to transpose with the T operator to get the data for each group in a single column.
0 1 2
group
A 2784 2724 2715
B 2301 2612 2489
C 2190 2139 2159
transposed is
group A B C
0 2784 2301 2190
1 2724 2612 2139
2 2715 2489 2159
If you only want the two values, it would be
>>> df.groupby('group')['hours'].apply(list).apply(pd.Series).T.corr().iloc[1:3,0].values
array([-0.86594029, 0.86783525])
In this example, you use iloc to get the second and third rows in the first column (python indexes are zero-based) and then the values property of a Series to return an array rather than a Series.

rounding series of pandas dataframes

I am trying to solve one of the coursera's homework for beginners.
I have read the data and tried to convert it as it shown in the code piece below. I am looking for the frequency distribution of the considered variables and for this reason I am trying to round the values. I tried several methods but nothing give me what I am expecting (see below please)..
import pandas as pd
import numpy as np
# loading the database file
data = pd.read_csv('gapminder-2.csv',low_memory=False)
# number of observations (rows)
print len(data)
# number of variables (columns)
print len(data.columns)
sub1 = pd.DataFrame({'income':data['incomeperperson'].convert_objects(convert_numeric=True),
'alcohol':data['alcconsumption'].convert_objects(convert_numeric=True),
'suicide':data['suicideper100th'].convert_objects(convert_numeric=True)})
sub1.apply(pd.Series.round)
income = sub1['income'].value_counts(sort=False)
print income
However, I got
285.224449 1
2712.517199 1
21943.339898 1
1036.830725 1
557.947513 1
What I expect:
285 1
2712 1
21943 1
1036 1
557 1

You can implement Series.round()
ser = pd.Series([1.1,2.1,3.1,5.1])
print(ser)
0 1.1
1 2.1
2 3.1
3 5.1
dtype: float64
From here you can use .round(), the default is set to 0 per docs.
print(ser.round())
0 1
1 2
2 3
3 5
dtype: float64
To save changes you need to re-assign it to ser=ser.round().

Faceted barplots rpy/ggplot2 - cannot make it work, only get one

This question is about ggplot2 in rpy2, but I will accept answers in R, if that is easier for you. I'll happily translate the solution afterwards.
I have a rpy2 dataframe that looks like the following (full file here):
times_seen nb_timepoints main_island other_timepoint
1 0 2 0 0
2 346 2 0 3
3 572 2 0 6
4 210 2 0 9
5 182 2 0 12
6 186 2 0 18
7 212 2 0 21
8 346 2 3 0
...
For each main_island and nb_timepoints I want to plot all the other_timepoints with their respective values.
I have the following code:
import rpy2.robjects.lib.ggplot2 as ggplot2
import rpy2.robjects as ro
p = ggplot2.ggplot(rpy2_df) + ggplot2.aes_string(y='times_seen', x='other_timepoint') + ggplot2.geom_bar(stat="identity")
I'd like to get something like what is shown in the image below, how do I achieve that?
Ps. I've appended a file that shows approx what I want to achieve (only that I want grids, labels, axes, etc.)

Add the following line to your plot:
p + ggplot2.facet_grid(ro.Formula('nb_timepoints ~ main_island'))
This produces a graph like:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to plot 'group by' in python - python

Related

Pandas Dataframe - How to transpose one value for the row n to the row n-5 [duplicate]

Key Error Trying to plot multiple bar charts from dataframe

Python compute correlation of a single variable between groups

rounding series of pandas dataframes

Faceted barplots rpy/ggplot2 - cannot make it work, only get one

Categories

Resources