I'm using Python module like Pandas, Matplotlib to make charts for a university Project.
I got some problems ordering the result in the pivot Table.
This is the body of a function, that takes 3 lists in input ([2017-03-03, ...], ['Username1', 'Username2',...], [1012020,103024,...]), analyze data and makes chart about it.
data = [date_list,username,field]
username_no_dup = list(set(username))
rows = zip(data[0], data[1], data[2])
headers = ['Date', 'Username', 'Value']
df = pd.DataFrame(rows, columns=headers)
df = df.sort_values('Value', ascending=False)
#*sort_values works but it is not sorting when converting to Pivot Table*
pivot_df = pd.pivot_table(df ,index='Date', columns='Username', values='Value')
pivot_df.loc[:,username_no_dup].plot(kind='bar', stacked=True, color=color_list, figsize=(15,7))
I would like to order by values with the greater value near the X-line of the chart. Everyone solved this problem??? Thank you
Here is the top rows of df sorted by value:
[['2017-03-15','SSL1_APP',1515091]
['2017-03-16','SSL1_APP',1373827]
['2017-03-18','SSL1_APP',1136483]
['2017-03-21','SSL1_APP',601810]
['2017-03-17','SSL1_APP',325561]
['2017-03-15','KE77_APP',284971]
['2017-03-16','AF77_APP',222588]
['2017-03-16','MI77_APP',222148]
['2017-03-15','AF77_APP',202224]
['2017-03-15','MI77_APP',191791]
['2017-03-17','AF77_APP',187709]
['2017-03-16','PC77_APP',185766]
['2017-03-15','NE77_APP',177475]
['2017-03-18','FBW2_APP',175156]
['2017-03-16','NE77_APP',174570]
['2017-03-17','BFD1_APP',164238]
['2017-03-15','BFD1_APP',162931]
['2017-03-20','AF77_APP',152186]
['2017-03-17','PC77_APP',148727]
['2017-03-18','MI77_APP',147460]
['2017-03-16','BFD1_APP',145815]
['2017-03-20','BFD1_APP',145449]
['2017-03-15','PC77_APP',144959]
['2017-03-20','SSL1_APP',141719]]
The first pic is the plot I have created. The second one is the result I want, plotted with Excel:
Note: This is a Python answer on the subject sorting your input.
One way of doing this would be using a bidimensional list(A lists of lists) and then sorting it.
This is how you've been using it:
data = [date0,username0,randint0,date1,username1, ....
Try a bidimensional list instead:
data = [[date0,username0,randint0], [date1,username1,randint1]...
Use the .sort() method and change the syntax to look like this:
data.sort() #Sort it, by default it will be a decreasing list.
rows = zip(data[0][0], data[0][1], data[0][2])
The standard .sort() method has its limitations(floats for one) so if doesn't return a desirable output try using .sort() parameters, here is a insight on the subject: How to use .sort()
If you are having trouble with floats, check a answer that will help you here.
Related
I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)
I wrote the script below, and I'm 98% content with the output. However, the unorganized manner/ disorder of the 'Approved' field bugs me. As you can see, I tried to sort the values using .sort_values() but was unsuccessful. The output for the written script is below, as is the list of fields in the data frame.
df = df.replace({'Citizen': {1: 'Yes',
0: 'No'}})
grp_by_citizen = pd.DataFrame(df.groupby(['Citizen']).agg({i: 'value_counts' for i in ['Approved']}).fillna(0))
grp_by_citizen.rename(columns = {'Approved': 'Count'}, inplace = True)
grp_by_citizen.sort_values(by = 'Approved')
grp_by_citizen
Do let me know if you need further clarification or insight as to my objective.
You need to reassign the result of sort_values or use inplace=True. From documentation:
Returns: DataFrame or None
DataFrame with sorted values or None if inplace=True.
grp_by_citizen = grp_by_citizen.sort_values(by = 'Approved')
First go with:
f = A.columns.values.tolist()
To see what is the actual names of your columns are. Then you can try:
A.sort_values(by=f[:2])
And if you sort by column name keep in mind that 2L is a long int, so just go:
A.sort_values(by=[2L])
How can I label my x-axis with multiple columns? Here's an example that works:
df = pd.DataFrame({"player_name": ["Alan","Bob","Carl","Dan","Earl"],
"jersey_number": ['1','2','3','4','5'],
"hits" : [2,3,1,2,4],
"at_bats" : [7,6,8,7,8]
})
df["label"] = df["player_name"]+"-"+df["jersey_number"]
df.plot(x="label", y=["hits", "at_bats"])
plt.show()
But this has an couple weaknesses. First, the example line to create the label column is tedious. Second, string concat is finicky. If the jersey_numbers aren't strings (e.g. ints instead), the concat fails. I can write a subroutine to take a list of columns, cast all as strings, and concat them. That seems like it should be unnecessary though, that there should be some built-in way to do this, something like:
df = pd.DataFrame({"player_name": ["Alan","Bob","Carl","Dan","Earl"],
"jersey_number": ['1','2','3','4','5'],
"hits" : [2,3,1,2,4],
"at_bats" : [7,6,8,7,8]
})
df.plot(x=["player_name","jersey_number"], y=["hits", "at_bats"])
plt.show()
This doesn't work; it throws ValueError: x must be a label or position.
My googlefu hasn't been strong enough to discover the correct syntax. Does it exist, and if yes what is it? Thanks
One option is to set those column as index then plot:
df.set_index(["player_name","jersey_number"]).plot( y=["hits", "at_bats"])
which gives
Although I would prefer your first approach since it gives better representation:
df["label"] = df[["player_name","jersey_number"]].astype(str).agg('-'.join)
or
df['label'] = [f'{x}-{y}' for x,y in zip(df["player_name"],df["jersey_number"]) ]
I'm trying to apply a function to a pandas dataframe, such a function required two np.array as input and it fit them using a well defined model.
The point is that I'm not able to apply this function starting from the selected columns since their "rows" contain list read from a JSON file and not np.array.
Now, I've tried different solutions:
#Here is where I discover the problem
train_df['result'] = train_df.apply(my_function(train_df['col1'],train_df['col2']))
#so I've tried to cast the Series before passing them to the function in both these ways:
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].astype(np.array)
X_col2_casted = trai_df['col2'].astype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
does'nt work.
What I'm thinking to do now is a long procedure like:
starting from the uncasted column-series, convert them into list(), iterate on them apply the function to the np.array() single elements, and append the results into a temporary list. Once done I will convert this list into a new column. ( clearly, I don't know if it will work )
Does anyone of you know how to help me ?
EDIT:
I add one example to be clear:
The function assume to have as input two np.arrays. Now it has two lists since they are retrieved form a json file. The situation is this one:
col1 col2 result
[1,2,3] [4,5,6] [5,7,9]
[0,0,0] [1,2,3] [1,2,3]
Clearly the function is not the sum one, but a own function. For a moment assume that this sum can work only starting from arrays and not form lists, what should I do ?
Thanks in advance
Use apply to convert each element to it's equivalent array:
df['col1'] = df['col1'].apply(lambda x: np.array(x))
type(df['col1'].iloc[0])
numpy.ndarray
Data:
df = pd.DataFrame({'col1': [[1,2,3],[0,0,0]]})
df
I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)