I have a dataframe that hast 3 columns. I made it up from a bigger dataframe like this :
new_df = df[['client_name', 'time_window_end', 'tag_count']]
then I used groupby to find out the number of tags for each client in each day using this code :
new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count()
I totally have 70 client names in a list an I want to loop through my list to plot a line
plot for each costumer name. in the x axis I want to have 'time_window_end' and in the y axis I want to have 'tag_count'.
I want 70 plot but the for loop that I have written does not do that. I would be happy if you could help me to fix it.
clients = new_df['client_name'].unique()
client_list = clients.tolist()
for client in client_list[:60]:
temp = new_df.loc[new_df['client_name'] == client]
x = temp.groupby(temp['time_window_end'].dt.floor('d'))['tag_count'].sum()
df2 = x.to_frame()
df2["time_window_end"]= pd.to_datetime(df2["time_window_end"])
line_chart = df2.copy()
plt.plot(line_chart.reset_index()["time_window_end"], x)
If I'm understanding this right, it sounds like the seaborn package might have what you need. The plotting functions take the argument 'hue' which splits plots up into multiple lines, based on the data in a column
import seaborn as sn
new_df = new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count().reset_index()
data = new_df,
x = pd.to_datetime(new_df["time_window_end"]),
y = 'tag_count',
hue = 'client_name',
kind = 'line')
EDIT: to get multiple plots
import seaborn as sn
new_df["time_window_end"] = pd.to_datetime(new_df["time_window_end"])
g = sn.FacetGrid(
data = new_df,
row = 'client_name')
g.map(sn.lineplot, 'time_window_end', 'tag_count')
EDIT again: to get separate plot images
import matplotlib.pyplot as plt
for name in pd.unique(new_df.client_names):
data = new_df.loc[new_df.client_names == name],
x = 'time_window_end',
y = 'tag_count',
label = name)
I would like to colour code the scatter plot based upon the two data frame values such that for each different values of df[1], a new color is to be assigned and for each df[2] value having same df[1] value, the assigned color earlier needs the opacity variation with highest value of df[2] (among df[2] values having same df[1] value) getting 100 % opaque and the lowest getting least opaque among the group of the data points.
Here is the code:
def func():
df = pd.read_csv(PATH + file, sep=",", header=None)
b = 2.72
a = 0.00000009
popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])
perr = np.sqrt(np.diag(pcov))
plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure
plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure
plt.legend(loc="upper left")
Here is the sample dataset:
So, the x-axis would be df[1] which are 31, 31, 31, 31, 34, 34,... and the y-axis is df[5], df[4], df[2] which are 9, 10, 413. For each different value of df[1], a new colour needs to be assigned. It would be fine to repeat the color cycles say after 6 unique colours. And among each color the opacity needs to be changed wrt to the value of df[2] (though y-axis is df[5], df[4], df[2]). The highest getting the darker version of the same color, and the lowest getting the lightest version of the same color.
and the scatter plot:
This is roughly how my desired solution of the color code needs to look like:
I have around 200 entries in the csv file.
Does using NumPy in this scenario is more advantageous ?
Let me know if this is appropriate or if I have misunderstood anything-
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# not needed for you
# df = pd.read_csv('~/Documents/tmp.csv')
max_2 = pd.DataFrame(df.groupby('1').max()['2'])
no_unique_colors = 3
color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
# assign colors to unique df2 in cyclic order
max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]
# calculate the opacities for each entry in the dataframe
colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
# repeat thrice so that df2, df4 and df5 share the same opacity
colors = [x for x in colors for _ in range(3)]
plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
Well, what do you know. I understood this task totally differently. I thought the point was to have alpha levels according to all df[2], df[4], and df[5] values for each df[1] value. Oh well, since I have done the work already, why not post it?
from matplotlib import pyplot as plt
import pandas as pd
from itertools import cycle
from matplotlib.colors import to_rgb
#read the data, column numbers will be generated automatically
df = pd.read_csv("data.txt", sep = ",", header=None)
#our figure with the ax object
fig, ax = plt.subplots(figsize=(10,10))
#definition of the colors
sc_color = cycle(["tab:orange", "red", "blue", "black"])
#get groups of the same df[1] value, they will also be sorted at the same time
dfgroups = df.iloc[:, [2, 4, 5]].groupby(by=df[1])
#plot each group with a different colour
for groupkey, groupval in dfgroups:
#create group dataframe with df[1] value as x and df[2], df[4], and df[5] values as y
groupval= groupval.melt(var_name="x", value_name="y")
groupval.x = groupkey
#get min and max y for the normalization
y_high = groupval.y.max()
y_low = groupval.y.min()
#read out r, g, and b values of the next color in the cycle
r, g, b = to_rgb(next(sc_color))
#create a colour array with nonlinear normalized alpha levels
#between 0.2 and 0.8, so that all data point are visible
group_color = [(r, g, b, 0.19 + 0.8 * ((y_high-val) / (y_high-y_low))**7) for val in groupval.y]
#and plot
ax.scatter(groupval.x, groupval.y, c=group_color)
Sample output of your data:
Two main problems here. One is that alpha in a scatter plot does not accept an array. But color does, hence, the detour to read out the RGB values and create an RGBA array with added alpha levels.
The other is that your data are spread over a rather wide range. A linear normalization makes changes near the lowest values invisible. There is surely some optimization possible; I like for instance this suggestion.
I need plot of aggregrated data
import pandas as pd
basic_data= pd.read_csv('WHO-COVID-19-global-data _2.csv',parse_dates= ['Date_reported'] )
cum_daily_cases = basic_data.groupby('Date_reported')[['New_cases']].sum()
import pylab
x = cum_daily_cases['Date_reported']
y = cum_daily_cases['New_cases']
Error: 'Date_reported'
Input: Date_reported, Country_code, Country, WHO_region, New_cases, Cumulative_cases, New_deaths, Cumulative_deaths 2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
Output: the total quantity of "New cases" showed on the plot per day.
What should I do to run this plot? link to dataset
The column names contain a leading space (can be easily seen by checking basic_data.dtypes). Fix that by adding the following line immediately after basic_data was read:
basic_data.columns = [s.strip() for s in basic_data.columns]
In addition, your x variable should be the index after groupby-sum, not a column Date_reported. Correction:
x = cum_daily_cases.index
The plot should show as expected.
I am learning to use matplotlib with pandas and I am having a little trouble with it. There is a dataframe which has districts and coffee shops as its y and x labels respectively. And the column values represent the start date of the coffee-shops in respective districts
starbucks cafe-cool barista ........ 60 shops
dist1 2008-09-18 2010-05-04 2007-02-21 ...............
dist2 2007-06-12 2011-02-17
100 districts
I want to plot a scatter plot with x axis as time series and y axis as coffee-shops. Since I couldn't figure out a direct one line way to plot this, I extracted the coffee-shops as one list and dates as other list.
shops = list(df.columns.values)
dt = pd.DataFrame(df.ix['dist1'])
dates = dt.set_index('dist1')
First I tried plt.plot(dates, shops). Got a ZeroDivisionError: integer division or modulo by zero - error. I could not figure out the reason for it. I saw on some posts that the data should be numeric, so I used ytick function.
y = [1, 2, 3, 4, 5, 6,...60]
still plt.plot(dates, y) threw same ZeroDivisionError. If I could get past this may be I would be able to plot using tick function. Source -
I am trying to plot the graph for only first row/dist1. For that I fetched the first row as a dataframe df1 = df.ix[1] and then used the following
for badges, dates in df.iteritems():
date = dates
ax.plot_date(date, yval)
# Record the number and label of the coffee shop
I got an error at line ax.plot_date(date, yval) saying x and y should be have same first dimension. Since I am plotting one by one for each coffe-shop for dist1 shouldn't the length always be one for both x and y? PS: date is a datetime.date object
To achieve this you need to convert the dates to datetimes, see here for
an example. As mentioned you also need to convert the coffee shops into
some numbering system then change the tick labels accordingly.
Here is an attempt
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
from datetime import datetime
def get_datetime(string):
"Converts string '2008-05-04' to datetime"
return datetime.strptime(string, "%Y-%m-%d")
# Generate datarame
df = pd.DataFrame(dict(
starbucks=["2008-09-18", "2007-06-12"],
cafe_cool=["2010-05-04", "2011-02-17"],
index=["dist1", "dist2"])
ax = plt.subplot(111)
label_list = []
label_ticks = []
yval = 1 # numbering system
# Iterate through coffee shops
for coffee_shop, dates in df.iteritems():
# Convert strings into datetime list
datetimes = [get_datetime(date) for date in dates]
# Create list of yvals [yval, yval, ...] to plot against
yval_list = np.zeros(len(dates))+yval
ax.plot_date(datetimes, yval_list)
# Record the number and label of the coffee shop
yval+=1 # Change the number so they don't all sit at the same y position
# Now set the yticks appropriately
# Set the limits so we can see everything