I am trying to output a dataframe and a chart for each of my 'vals'. I'm struggling to piece together some of these Pythonic basics.
Flow: I take the dataframe, do a groupby, get the percentage of total... Output a table and a chart. However, I want to loop through this process, the first time with a dataframe filter on Reviewed?=='Yes', and then by No.
data = {'Region': ["US", "US", "US","US"],
'Gender': ["M","F","F","M"],
'Reviewed?': ["Yes","Yes","No","No"]}
df = pd.DataFrame(data, columns=['Region','Gender','Reviewed?'])
def func(df):
vals = ['Yes','No']
for i in range(len(vals)):
for x in vals:
gb[i] = df[df['Reviewed?']==x].groupby(['Gender'])['Region'].count().reset_index()
total[i] = gb[i]['Region'].sum()
gb[i]['Percentage'] = (gb[i]['Region'] / total[i])
gb[i] = gb[i].sort_values(by='Percentage', ascending=False)
sns.barplot(data=gb[i], x='Region', y='Percentage')
plt.show()
return gb[i]
few errors messages:
ValueError: could not broadcast input array from shape (0,2) into shape (0)
ValueError: cannot copy sequence with size 2 to array axis with dimension 0
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
Update
Here is a brute force version of what I want. I just want a more efficient and dynamic way to do this.
Note, I wasn't originally explicit that I wanted to keep the counts in the final dataframe...
import pandas as pd
import seaborn as sns
data = {'Region': ["US", "US", "US","US"],
'Gender': ["M","F","F","M"],
'Reviewed?': ["Yes","Yes","No","No"]}
df = pd.DataFrame(data, columns=['Region','Gender','Reviewed?'])
def func(df):
gb = df[df['Reviewed?']=='No'].groupby(['Gender'])['Region'].count().reset_index()
total = gb['Region'].sum()
gb['Percentage'] = (gb['Region'] / total)
notyetreviewed = gb.sort_values(by='Percentage', ascending=False)
sns.barplot(data=notyetreviewed, x='Gender', y='Percentage')
bottom, top = plt.ylim(0,1)
plt.show()
gb = df[df['Reviewed?']=='Yes'].groupby(['Gender'])['Region'].count().reset_index()
total = gb['Region'].sum()
gb['Percentage'] = (gb['Region'] / total)
reviewed = gb.sort_values(by='Percentage', ascending=False)
bottom, top = plt.ylim(0,1)
sns.barplot(data=reviewed, x='Gender', y='Percentage')
plt.show()
return notyetreviewed, reviewed
func(df)
You can try something like this:
import pandas as pd
data = {'Region': ["US", "US", "US","US"],
'Gender': ["M","F","F","M"],
'Reviewed?': ["Yes","Yes","No","No"]}
df = pd.DataFrame(data, columns=['Region','Gender','Reviewed?'])
for outcome in ['Yes', 'No']:
filtered = df[df['Reviewed?'].eq(outcome)]['Gender'].value_counts(normalize=True)
filtered.plot.bar()
In this case, I'm filtering the DF on each loop by the Reviewed? outcome and then getting the proportional values for male and female. Your question poses a binary choice, but I suppose it could be extended by for outcome in df['Reviewed?'].unique():
This is a marginal improvement. It would be nice to see a more Pythonic solution that wouldn't require me to hard code 'Reviewed?' into the function call...
import pandas as pd
import seaborn as sns
data = {'Region': ["US", "US", "US","US"],
'Gender': ["M","F","F","M"],
'Reviewed?': ["Yes","Yes","No","No"]}
df = pd.DataFrame(data, columns=['Region','Gender','Reviewed?'])
def func(df,group,reviewed):
df = df[df['Reviewed?'].isin(reviewed)].groupby([group])['Region'].count().reset_index()
df['Percentage'] = df['Region'] / df['Region'].sum()
sns.barplot(data=df, x='Gender', y='Percentage')
bottom, top = plt.ylim(0,1)
plt.show()
return df
df1 = func(df,'Gender',['Yes'])
df1 = func(df,'Gender',['No'])
Related
I've got a weird question for a class project. Assuming X ~ Exp(Lambda), Lambda=1.6, I have to generate 100 samples of X, with the indices corresponding to the sample size of each generated sample (S1, S2 ... S100). I've worked out a simple loop which generate the required samples in array, but i am not able to rename the array.
First attempt:
import numpy as np
import matplotlib.pyplot as plt
samples = []
for i in range(1,101,1):
samples.append(np.random.exponential(scale= 1/1.6, size= i))
Second attempt:
import numpy as np
import matplotlib.pyplot as plt
for i in range(1,101,1):
samples = np.random.exponential(scale= 1/1.2, size= i)
col = f'samples {i}'
df_samples[col] = exponential_sample
df_samples = pd.DataFrame(samples)
An example how I would like to visualize the data:
# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
df2 = pd.DataFrame(index= ['x1', 'x2'] )
for i in range(1, 51):
exponential_sample = np.random.exponential((1/rate), sample_size)
col = f'sample {i}'
df2[col] = exponential_sample
# Taking a peek at the samples
df2
But instead of having a simple size = 2, I would like to have sample size = i. This way, I will be able to generate 1 rows for the first column (S1), 2 rows for the second column (S2), until I reach 100 rows for the 100th column (S100).
You cannot stick vectors of different lengths easily into a df so your mock-up code would not work, but you can concat one vector at a time:
df = pd.DataFrame()
for i in range(100,10100,100):
tmp = pd.DataFrame({f'S{i}':np.random.exponential(scale= 1/1.2, size= i)})
df = pd.concat([df, tmp], axis=1)
Use a dict instead maybe?
samples = {}
for i in range(100,10100,100):
samples[i] = np.random.exponential(scale= 1/1.2, size= i)
Then you can convert it into a pandas Dataframe if you like.
I have a data frame which is like the following :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
df_input = pd.read_csv('combine_input.csv', delimiter=',')
df_output = pd.read_csv('combine_output.csv', delimiter=',')
In this data frame, there are many repeated rows for example the first row is repeated more than 1000 times, and so on for the other rows
when I plot the time distribution I got that figure which shows that the frequency of the time parameter
df_input.plot(y='time',kind = 'hist',figsize=(10,10))
plt.grid()
plt.show()
My question is how can I take the data only in the following red rectangular for example at time = 0.006 and frequency = 0.75 1e6 ( check the following pic )
Note: InPlace of target you have to write time as your column name Is time,or change column name to target
def calRows(df,x,y):
#df For consideration
df1 = pd.DataFrame(df.target[df.target<=x])
minCount = len(df1)
targets = df1.target.unique()
for i in targets:
count = int(df1[df1.target == i].count())
if minCount > count:
minCount = count
if minCount > y:
minCount = int(y)
return minCount
You have To pass your data frame, x-intercept of the graph, y-intercept of graph to calRows(df,x,y) function which will return the number of rows to take for each target.
rows = CalRows(df,6,75)
print(rows)
takeFeatures(df,rows,x) function will take dataframe, rows (result of first function), x-intercept of graph and will return you the final dataframe.
def takeFeatures(df,rows,x):
finalDf = pd.DataFrame(columns = df.columns)
df1 = df[df.target<=x]
targets = df1.target.unique()
for i in targets:
targeti = df1[df1.target==i]
sample = targeti.sample(rows)
finalDf = pd.concat([finalDf,sample])
return finalDf
Calling takeFeature() Function
final = takeFeatures(df,rows,6)
print(final)
Your Final DataFrame will have the Values ThatYou expected in Graph
And After Plotting this final dataframe you will get like this graph
I'm trying to calculate a rolling mean, max, min, and std for specific columns inside a time series pandas dataframe. But I keep getting NaN for the lagged values and I'm not sure how to fix it. My MWE is:
import numpy as np
import pandas as pd
# original data
df = pd.DataFrame()
np.random.seed(0)
days = pd.date_range(start='2015-01-01', end='2015-05-01', freq='1D')
df = pd.DataFrame({'Date': days, 'col1': np.random.randn(len(days)), 'col2': 20+np.random.randn(len(days)), 'col3': 50+np.random.randn(len(days))})
df = df.set_index('Date')
print(df.head(10))
def add_lag(dfObj, window):
cols = ['col2', 'col3']
for col in cols:
rolled = dfObj[col].rolling(window)
lag_mean = rolled.mean().reset_index()#.astype(np.float16)
lag_max = rolled.max().reset_index()#.astype(np.float16)
lag_min = rolled.min().reset_index()#.astype(np.float16)
lag_std = rolled.std().reset_index()#.astype(np.float16)
dfObj[f'{col}_mean_lag{window}'] = lag_mean[col]
dfObj[f'{col}_max_lag{window}'] = lag_max[col]
dfObj[f'{col}_min_lag{window}'] = lag_min[col]
dfObj[f'{col}_std_lag{window}'] = lag_std[col]
# add lag feature for 1 day, 3 days
add_lag(df, window=1)
add_lag(df, window=3)
print(df.head(10))
print(df.tail(10))
Just don't do reset_index(). Then it works.
import numpy as np
import pandas as pd
# original data
df = pd.DataFrame()
np.random.seed(0)
days = pd.date_range(start='2015-01-01', end='2015-05-01', freq='1D')
df = pd.DataFrame({'Date': days, 'col1': np.random.randn(len(days)), 'col2': 20+np.random.randn(len(days)), 'col3': 50+np.random.randn(len(days))})
df = df.set_index('Date')
print(df.head(10))
def add_lag(dfObj, window):
cols = ['col2', 'col3']
for col in cols:
rolled = dfObj[col].rolling(window)
lag_mean = rolled.mean()#.reset_index()#.astype(np.float16)
lag_max = rolled.max()#.reset_index()#.astype(np.float16)
lag_min = rolled.min()#.reset_index()#.astype(np.float16)
lag_std = rolled.std()#.reset_index()#.astype(np.float16)
dfObj[f'{col}_mean_lag{window}'] = lag_mean#[col]
dfObj[f'{col}_max_lag{window}'] = lag_max#[col]
dfObj[f'{col}_min_lag{window}'] = lag_min#[col]
dfObj[f'{col}_std_lag{window}'] = lag_std#[col]
# add lag feature for 1 day, 3 days
add_lag(df, window=1)
add_lag(df, window=3)
print(df.head(10))
print(df.tail(10))
Whenever you use the rolling function, it creates NaN for the values that it cannot calculate.
For example, consider a single column, col1 = [2, 4, 10, 6], and a rolling window of 2.
The output of the rolling window will be NaN, 3, 7, 8.
This is because the rolling average of the first value cannot be calculated since there the window looks at that given index and the previous value, for which there is none.
Then, when you calculate the mean, std, etc you are calculating a series functions without accounting for the NaN. In R, you can usually just do na.rm=T; however, in Python it is recommended that you drop the NaN values, then calculate the series function.
I have 2 data frames:
df1 contains columns: “time”, “bid_price”
df2 contains columns: “time”, “flag”
I want to plot a time series of df1 as a line graph and i want to put markers on that trace at points where df2 “flag” column value = True at those points in time
How can i do this?
You can do so in three steps:
set up a figure using go.Figure(),
add a trace for your bid_prices using fig.update(go.Scatter)
and do the same thing for your flags.
The snippet below does exactly what you're describing in your question. I've set up two dataframes df1 and df2, and then I've merged them together to make things a bit easier to reference later on.
I'm also showing flags for an accumulated series where each increment in the series > 0.9 is flagged in flags = [True if elem > 0.9 else False for elem in bid_price] . You should be able to easily adjust this to whatever your real world dataset looks like.
Plot:
Complete code with random data:
# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import random
# settings
observations = 100
np.random.seed(5); cols = list('a')
bid_price = np.random.uniform(low=-1, high=1, size=observations).tolist()
flags = [True if elem > 0.9 else False for elem in bid_price]
time = [t for t in pd.date_range('2020', freq='D', periods=observations).format()]
# bid price
df1=pd.DataFrame({'time': time,
'bid_price':bid_price})
df1.set_index('time',inplace = True)
df1.iloc[0]=0; d1f=df1.cumsum()
# flags
df2=pd.DataFrame({'time': time,
'flags':flags})
df2.set_index('time',inplace = True)
df = df1.merge(df2, left_index=True, right_index=True)
df.bid_price = df.bid_price.cumsum()
df['flagged'] = np.where(df['flags']==True, df['bid_price'], np.nan)
# plotly setup
fig = go.Figure()
# trace for bid_prices
fig.add_traces(go.Scatter(x=df.index, y=df['bid_price'], mode = 'lines',
name='bid_price'))
# trace for flags
fig.add_traces(go.Scatter(x=df.index, y=df['flagged'], mode = 'markers',
marker =dict(symbol='triangle-down', size = 16),
name='Flag'))
fig.update_layout(template = 'plotly_dark')
fig.show()
this question is probably simple, but I just can't figure out how to do it.
I have a dataframe grouped by a column. I want to plot each group, but only if its size is > 2.
Here is my code:
df1=df.groupby('Origin')
import matplotlib.pyplot as plt
for key, group in df1:
plt.figure()
group.plot(x='xColumnr', y='yColumn', title=str(key))
I have tried to filter out these groups using df2=df1.filter(lambda group: group.size() > 2) and set df2 in place of df1 in my code, but that gets me the error TypeError: 'numpy.int32' object is not callable.
Then I tried
df3=df1.size()
if df3[df3 > 2]:
plot stuff
which raises the exception 'True and False columns missing'.
How can I build in the if condition to plot only groups with a size > 2?
You should be able to iterate through the dataset and decide if the groups have enough data or not:
import pandas as pd
import matplotlib.pyplot as plt
names = ['Bob','Jessica','Mary','John','Mel']
zipcode = [100, 100, 77, 77, 973]
weight = [100, 200, 300, 400, 500]
BabyDataSet = zip(names,zipcode, weight)
df = pd.DataFrame(data = BabyDataSet, columns=['Name', 'Zipcode', 'Weight'])
grouped = df.groupby(df.Zipcode)
for key, group in grouped:
entries = group.size
columns = len(group.columns)
if entries/columns >= 2:
plt.figure()
group.plot(x='Zipcode', y='Weight', title=str(key))
There probably is a much nicer way still though.
Example inspired by http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/01%20-%20Lesson.ipynb