This question is about ggplot2 in rpy2, but I will accept answers in R, if that is easier for you. I'll happily translate the solution afterwards.
I have a rpy2 dataframe that looks like the following (full file here):
times_seen nb_timepoints main_island other_timepoint
1 0 2 0 0
2 346 2 0 3
3 572 2 0 6
4 210 2 0 9
5 182 2 0 12
6 186 2 0 18
7 212 2 0 21
8 346 2 3 0
...
For each main_island and nb_timepoints I want to plot all the other_timepoints with their respective values.
I have the following code:
import rpy2.robjects.lib.ggplot2 as ggplot2
import rpy2.robjects as ro
p = ggplot2.ggplot(rpy2_df) + ggplot2.aes_string(y='times_seen', x='other_timepoint') + ggplot2.geom_bar(stat="identity")
I'd like to get something like what is shown in the image below, how do I achieve that?
Ps. I've appended a file that shows approx what I want to achieve (only that I want grids, labels, axes, etc.)
Add the following line to your plot:
p + ggplot2.facet_grid(ro.Formula('nb_timepoints ~ main_island'))
This produces a graph like:
Related
I suppose that someone might have asked this already, but for the life of me I cannot find what I need after some looking, possibly my level of Py is too low.
I saw several questions with answers using globals() and exec() with comments that it's a bad idea, other answers suggest using dictionaries or lists. At this point I got a bit loopy about what to use here and any help would be very welcome.
What I need is roughly this:
I have a Python DataFrame, say called dftest
I'd like to split dftest into say 6 parts of similar size
then I'd like to iterate over them (or possibly parallelise?) and run some steps calling some spatial functions that use parameters (param0,param1, ... param5) over each of the rows of each df to add more columns, preferably export each result to a csv (as it takes long time to complete one part, I wouldn't want to loose the result of each iteration)
And then I'd like to put them back together into one DataFrame, say dfresult (possibly with concat) and continue doing the next thing with it
To keep it simple, this is how a toy dftest looks like (the original df has more rows and columns):
print(dftest)
# rowid type lon year
# 1 1 Tomt NaN 2021
# 2 2 Lägenhet 12.72 2022
# 3 3 Lägenhet NaN 2017
# 4 4 Villa 17.95 2016
# 5 5 Radhus 17.95 2021
# 6 6 Villa 17.95 2016
# 7 7 Fritidshus 18.64 2020
# 8 8 Villa 18.64 2019
# 9 9 Villa 18.63 2021
# 10 10 Villa 18.63 2019
# 11 11 Villa 17.66 2017
# 12 12 Radhus 17.66 2022
So here is what I tried:
dfs = np.array_split(dftest, 6)
for j in range(0,6):
print ((f'dfs[{j}] has'),len(dfs[j].index),'obs ',min(dfs[j].index),'to ',max (dfs[j].index))
where I get output:
# dfs[0] has 2 obs 1 to 2
# dfs[1] has 2 obs 3 to 4
# dfs[2] has 2 obs 5 to 6
# dfs[3] has 2 obs 7 to 8
# dfs[4] has 2 obs 9 to 10
# dfs[5] has 2 obs 11 to 12
So now I'd like to iterate over each df and create more columns. I tried a hardcoded test, one by one something like this:
for row in tqdm(dfs[0].itertuples()):
x = row.type
y = foo.bar(x, param="param0")
i = row[0]
dfs[0].x[i, 'anotherColumn'] = baz(y)
#... some more functions ...
dfs[0].to_csv("/projectPath/dfs0.csv")
I suppose this should be possible to automate or even run in parallel (how?)
And in the end I'll try putting them together (no clue if this would work), possibly something like this:
pd.concat([dfs[0],dfs[1],dfs[2],dfs[3],dfs[4],dfs[5]])
If I had a 100 parts - perhaps dfs[0]:dfs[5] would work...I'm still in the previous step
PS. I'm using a Jupyter notebook on localhost with Python3.
As far as I understand, you can use the chunk_apply function of the parallel-pandas library. This function splits the dataframe into chunks and applies a custom function to each chunk then concatenates the result. Everything works in parallel.Toy example:
#pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
# n_cpu - count of cores and split chunks
ParallelPandas.initialize(n_cpu=8)
def foo(df):
# do something with df
df['new_col'] = df.sum(axis=1)
return df
if __name__ == '__main__':
ROW = 10000
COL = 10
df = pd.DataFrame(np.random.random((ROW, COL)))
res = df.chunk_apply(foo, axis=0)
print(res.head())
Out:
0 1 2 ... 8 9 new_col
0 0.735248 0.393912 0.966608 ... 0.261675 0.207216 6.276589
1 0.256962 0.461601 0.341175 ... 0.688134 0.607418 5.297881
2 0.335974 0.093897 0.622115 ... 0.442783 0.115127 3.102827
3 0.488585 0.709927 0.209429 ... 0.942065 0.126600 4.367873
4 0.619996 0.704085 0.685806 ... 0.626539 0.145320 4.901926
I am using an SQLite database with Pandas and want to display the dynamic data using Bokeh (varea_stack)
My dynamic data (df) structure looks like this:
id date site numberOfSessions ... avgSessionDuration uniqueDimensionCombinations events pageViews
0 1 2020-07-29 177777770 3 ... 11.00 2 4 3
1 2 2020-07-29 178888883 1 ... 11.00 1 4 3
2 3 2020-07-29 177777770 1 ... 11.00 1 4 3
3 4 2020-07-29 173333333 2 ... 260.50 2 23 10
4 5 2020-07-29 178888883 2 ... 260.50 2 23 10
5 6 2020-07-29 173333333 2 ... 260.50 2 23 10
6 7 2020-07-29 178888883 12 ... 103.75 12 143 36
7 8 2020-07-30 178376403 12 ... 103.75 12 143 36
8 9 2020-07-30 178376403 12 ... 103.75 12 143 36
9 10 2020-07-28 178376403 12 ... 103.75 12 143 36
I would like to create a varea_stack plot where the:
x-axis -> "date"
y-axis -> "numberOfSessions" stacked according to "site"
(I am thinking maybe using some sort of Pivot Table?)
this is what I have:
from bokeh.plotting import figure, output_file, show
from bokeh.embed import components
from bokeh.models import HoverTool
plot = figure()
plot.varea_stack(df.site.unique().tolist(), x=df.index.values.tolist(), source=df)
script, div = components(plot)
the Error I get:
Keyword argument sequences for broadcasting must be the same length as stackers
I have been searching online (https://docs.bokeh.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure.varea_stack) and through Stackoverflow. I can't seem to find an answer.
I can't really speak to the Pandas operations needed, but this is the general format the data needs to be in for varea_stack:
sites = [<list of sites>]
data = {
'date' : <all the datetime values>,
<site1> : <site1 values for every date>,
<site2> : <site2 values for every date>,
<site3> : <site3 values for every date>,
...
}
plot.varea_stack(sites, x='date', source=data)
Note that to be usable by varea_stack the following must be true:
every item in the sites list has to be a column in the data
every sites column has to be the same length (a value for every date)
Note that the above also assumes the dates are converted to real datetime values. If you are using your dates are categoricals (i.e. not using real datetimes and a continous datetime axis) then you will need to pass the list of date (strings) to the x_range for figure as well (as with any categorical axis).
I have a problem regarding how I can plot multi-indexed data in a single bar chart. I started with a DataFrame with three columns (artist, genre and miscl_count) and 195 rows. I then grouped the data by two of the columns, which resulted in the table below. My question is, how can I create a bar plot from this, so that the each group in "miscl_count" are shown as three separate bar plots across all five genres (i.e. a total amount of 3x5 bars)? I would also like the genre to identify what color a bar is assigned.
I know that there is unstacking, but I don't understand how I can get this to work with Matplotlib or Seaborn.
The head of the DataFrame, that I perform the groupby method on looks like this:
print(miscl_df.head())
artist miscl_count genre
0 band1 5 a
1 band2 6 b
2 band3 5 b
3 band4 4 b
4 band5 5 b
5 band6 5 c
miscl_df_group = miscl_df.groupby(['genre', 'miscl_count']).count()
print(miscl_df_group)
After group by, the output looks like this:
artist
miscl_count 4 5 6
genre
a 11 9 9
b 19 13 16
c 13 14 16
d 10 9 12
e 21 14 10
Just to make sure I made myself clear, the output should be shown as a single chart (and not as subplots)!
Working solution to be used on the grouped data:
miscl_df_group.unstack(level='genre').plot(kind='bar')
Alternatively, it can also be used this way:
miscl_df_group.unstack(level='miscl_count').plot(kind='bar')
with seaborn, no need to group the data, this is done under the hood:
import seaborn as sns
sns.barplot(x="artist", y="miscl_count", hue="genre", data=miscl_df)
(change the column names at will, depending on what you want)
# full working example
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame()
df["artist"] = list(map(lambda i: f"band{i}", np.random.randint(1,4,size=(100,))))
df["genre"] = list(map(lambda i: f"genre{i}", np.random.randint(1,6,size=(100,))))
df["count"] = np.random.randint(50,100,size=(100,))
# df
# count genre artist
# 0 97 genre9 band1
# 1 95 genre7 band1
# 2 65 genre3 band2
# 3 81 genre1 band1
# 4 58 genre10 band1
# .. ... ... ...
# 95 61 genre1 band2
# 96 53 genre9 band2
# 97 55 genre9 band1
# 98 94 genre1 band2
# 99 85 genre8 band1
# [100 rows x 3 columns]
sns.barplot(x="artist", y="count", hue="genre", data=df)
In the (5 first rows) result below, you can see Freq column and the rolling means (3) column MMeans calculated using pandas:
Freq MMeans
0 215 NaN
1 453 NaN
2 277 315.000000
3 38 256.000000
4 1 105.333333
I was expecting MMeans to start at index 1 since 1 is the mean of (0-1-2). Is there an option that I am missing with rolling method?
edit 1
print(pd.DataFrame({
'Freq':eff,
'MMeans': dF['Freq'].rolling(3).mean()}))
edit 2
Sorry #Yuca for not being as clear as I'd like to. Next is the columns I'd like pandas to return :
Freq MMeans
0 215 NaN
1 453 315.000000
2 277 256.000000
3 38 105.333333
4 1 29.666667
which are not the results returned with min_periods=2
use min_periods =1
df['rol_mean'] = df['Freq'].rolling(3,min_periods=1).mean()
output:
Freq MMeans rol_mean
0 215 NaN 215.000000
1 453 NaN 334.000000
2 277 315.000000 315.000000
3 38 256.000000 256.000000
4 1 105.333333 105.333333
I am trying to solve one of the coursera's homework for beginners.
I have read the data and tried to convert it as it shown in the code piece below. I am looking for the frequency distribution of the considered variables and for this reason I am trying to round the values. I tried several methods but nothing give me what I am expecting (see below please)..
import pandas as pd
import numpy as np
# loading the database file
data = pd.read_csv('gapminder-2.csv',low_memory=False)
# number of observations (rows)
print len(data)
# number of variables (columns)
print len(data.columns)
sub1 = pd.DataFrame({'income':data['incomeperperson'].convert_objects(convert_numeric=True),
'alcohol':data['alcconsumption'].convert_objects(convert_numeric=True),
'suicide':data['suicideper100th'].convert_objects(convert_numeric=True)})
sub1.apply(pd.Series.round)
income = sub1['income'].value_counts(sort=False)
print income
However, I got
285.224449 1
2712.517199 1
21943.339898 1
1036.830725 1
557.947513 1
What I expect:
285 1
2712 1
21943 1
1036 1
557 1
You can implement Series.round()
ser = pd.Series([1.1,2.1,3.1,5.1])
print(ser)
0 1.1
1 2.1
2 3.1
3 5.1
dtype: float64
From here you can use .round(), the default is set to 0 per docs.
print(ser.round())
0 1
1 2
2 3
3 5
dtype: float64
To save changes you need to re-assign it to ser=ser.round().