I have a data set sort of like this:
fruits = ["orange", "plum", "lime"]
data = [(random.choice(fruits),
random.randint(0,100),
random.randint(0,100)) for i in range(16)]
dframe = pd.DataFrame(data, columns=["fruit", "x", "y"])
where fruit has only a few values. I want a select widget so you can pick which kind of fruit you want to see in the plot.
Here's the update function I currently have:
source = bk.ColumnDataSource(dframe)
by_fruit = dframe.groupby('fruit')
def update(fruit):
grouped = by_fruit.get_group(fruit)
source.data['x'] = grouped['x']
source.data['y'] = grouped['y']
source.data['fruit'] = grouped['fruit']
source.push_notebook()
interact(update, fruit=fruits)
but going through and re-assigning the values of each column seems excessively verbose as I get more columns. It's error-prone, as if I leave out a column, they become different lengths and get misaligned.
Pandas excels at slicing and dicing things, and I feel like I'm missing something. What's a more concise way to change the Series in each column of the ColumnDataSource at the same time?
[This example in an IPython Notebook]
You could iterate over the columns of grouped:
def update(fruit):
grouped = by_fruit.get_group(fruit)
for col in grouped:
source.data[col] = grouped[col]
source.push_notebook()
Related
I have a dataframe that looks like this:
df = pd.DataFrame(data=list(range(0,10)),
index=pd.MultiIndex.from_product([[str(list(range(0,1000)))],list(range(0,10))],
names=["ind1","ind2"]),
columns=["col1"])
df['col2']=str(list(range(0,1000)))
Unfortunately, the display of the above dataframe looks like this:
If I try to set: pd.options.display.max_colwidth = 5, then col2 behaves and it is displayed in a single row, but ind1 doesn't behave:
Since ind1 is part of a multiindex, I don't care it occupies multiple rows, but I would like to limit itself in width. If I could prescribe for each row to also occupy at most the height of a single line, that would be great as well. I don't care that individual cells are being truncated on display, because I prefer to have to scroll less, in any direction, to see a cell.
I am aware I can create my own HTML display. That's great and all, but I think it's too complex for my use case of just wanting smaller width columns for data analysis in jupyter notebooks. Nevertheless, such a solution might help other similar use cases, if you are inclined to write one.
What I'm looking for is some setting, which I thought it's pd.options.display.max_colwidth, that limits the column width, even if it's an index. Something that would disable wrapping for long texts would probably help with the same issue as well.
I also tried to just print without the index df.style.hide_index(), in combination with pd.options.display.max_colwidth = 5, but then col2 stops behaving:
About now I run out of ideas. Any suggestions?
Here is one way to do it:
import pandas as pd
df = pd.DataFrame(
data=list(range(0, 10)),
index=pd.MultiIndex.from_product(
[[str(list(range(0, 1000)))], list(range(0, 10))], names=["ind1", "ind2"]
),
columns=["col1"],
)
df["col2"] = str(list(range(0, 1000)))
In the next Jupyter cell, run:
df.style.set_properties(**{"width": "10"}).set_table_styles(
[{"selector": "th", "props": [("vertical-align", "top")]}]
)
Which outputs:
I'm trying to create a new column 'BroadCategory' within a dataframe based on whether values within another column called 'Venue Category' within the data occur in specific lists. I have 5 lists that I am using to fill in the values in the new column
For example:
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Bar),'Bar','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Museum_ArtGallery),'Museum/Art Gallery','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Public_Transport),'Public Transport','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Restaurant_FoodVenue),'Restaurant/Food Venue','Other')
I ultimately want the values in VenueCategory column occurring in the list Bar to be labeled 'Bar' and those occurring in the list Museum_ArtGallery to be labeled 'Museum_ArtGallery', etc. My code above doesn't accomplish this.
I tried this in order to keep the values I had previously filled but it's still overwriting the values I had filled in based on my previous conditions:
df['BroadCategory'] = np.where(df[df.VenueCategory!='Other'].isin(Entertainment_Venue),'Entertainment Venue','Other')
How can I fill the column BoardCategory with the specific values based on whether the values in the VenueCategory column occur in the specified lists Bar, Restaurant, Public_Transport, Museum_ArtGallery, etc?
support your data is like this
df=pd.DataFrame({'VenueCategory':['drink','wine','MOMA','MTA','sushi','Hudson']})
Bar=['drink','wine','alcohol']
Museum_ArtGallery=['MOMA','MCM']
Public_Transport=['MTA','MBTA']
Restaurant_FoodVenue=['sushi','chicken']
prepare a dictionary:
from collections import defaultdict
d=defaultdict(lambda:'other')
d.update({x:'Bar' for x in Bar})
d.update({x:'Museum_ArtGallery' for x in Museum_ArtGallery})
d.update({x:'Public_Transport' for x in Public_Transport})
d.update({x:'Restaurant_FoodVenue' for x in Restaurant_FoodVenue})
build new column and print result:
df['BroadCategory']=df['VenueCategory'].apply(lambda x:d[x])
df
venue_list = [['Bar', Bar],
['Museum_ArtGallery',Museum_ArtGallery]
#etc
]
venue_lookup = pd.concat([
pd.DataFrame({
'BroadCategory':venue[0],
'VenueCategory':venue[1]}) for venue in venue_list]
)
pd.merge(df, venue_lookup, how='left', on = 'VenueCategory')
Your solution is already close. Just that in order not to overwrite previously values, you should get a subset of the rows and only set new values on the subset.
To do that, you can firstly initialize new column BroadCategory to 'Other'. Then set up a subset of rows of each category by subscripting the new column with Boolean mask using the .isin() function like you are using now. The codes are like below:
df['BroadCategory'] = 'Other'
df['BroadCategory'][df['VenueCategory'].isin(Bar)] = 'Bar'
df['BroadCategory'][df['VenueCategory'].isin(Museum_ArtGallery)] = 'Museum/Art Gallery'
df['BroadCategory'][df['VenueCategory'].isin(Public_Transport)] = 'Public Transport'
df['BroadCategory'][df['VenueCategory'].isin(Restaurant_FoodVenue)] = 'Restaurant/Food Venue'
df['BroadCategory'][df['VenueCategory'].isin(Entertainment_Venue)] = 'Entertainment Venue'
I'm using Python module like Pandas, Matplotlib to make charts for a university Project.
I got some problems ordering the result in the pivot Table.
This is the body of a function, that takes 3 lists in input ([2017-03-03, ...], ['Username1', 'Username2',...], [1012020,103024,...]), analyze data and makes chart about it.
data = [date_list,username,field]
username_no_dup = list(set(username))
rows = zip(data[0], data[1], data[2])
headers = ['Date', 'Username', 'Value']
df = pd.DataFrame(rows, columns=headers)
df = df.sort_values('Value', ascending=False)
#*sort_values works but it is not sorting when converting to Pivot Table*
pivot_df = pd.pivot_table(df ,index='Date', columns='Username', values='Value')
pivot_df.loc[:,username_no_dup].plot(kind='bar', stacked=True, color=color_list, figsize=(15,7))
I would like to order by values with the greater value near the X-line of the chart. Everyone solved this problem??? Thank you
Here is the top rows of df sorted by value:
[['2017-03-15','SSL1_APP',1515091]
['2017-03-16','SSL1_APP',1373827]
['2017-03-18','SSL1_APP',1136483]
['2017-03-21','SSL1_APP',601810]
['2017-03-17','SSL1_APP',325561]
['2017-03-15','KE77_APP',284971]
['2017-03-16','AF77_APP',222588]
['2017-03-16','MI77_APP',222148]
['2017-03-15','AF77_APP',202224]
['2017-03-15','MI77_APP',191791]
['2017-03-17','AF77_APP',187709]
['2017-03-16','PC77_APP',185766]
['2017-03-15','NE77_APP',177475]
['2017-03-18','FBW2_APP',175156]
['2017-03-16','NE77_APP',174570]
['2017-03-17','BFD1_APP',164238]
['2017-03-15','BFD1_APP',162931]
['2017-03-20','AF77_APP',152186]
['2017-03-17','PC77_APP',148727]
['2017-03-18','MI77_APP',147460]
['2017-03-16','BFD1_APP',145815]
['2017-03-20','BFD1_APP',145449]
['2017-03-15','PC77_APP',144959]
['2017-03-20','SSL1_APP',141719]]
The first pic is the plot I have created. The second one is the result I want, plotted with Excel:
Note: This is a Python answer on the subject sorting your input.
One way of doing this would be using a bidimensional list(A lists of lists) and then sorting it.
This is how you've been using it:
data = [date0,username0,randint0,date1,username1, ....
Try a bidimensional list instead:
data = [[date0,username0,randint0], [date1,username1,randint1]...
Use the .sort() method and change the syntax to look like this:
data.sort() #Sort it, by default it will be a decreasing list.
rows = zip(data[0][0], data[0][1], data[0][2])
The standard .sort() method has its limitations(floats for one) so if doesn't return a desirable output try using .sort() parameters, here is a insight on the subject: How to use .sort()
If you are having trouble with floats, check a answer that will help you here.
My dataframe has a column contains various type values, I want to get the most counted one:
In this case, I want to get the label FM-15, so later on I can query data only labled by this.
How can I do that?
Now I can get away with:
most_count = df['type'].value_counts().max()
s = df['type'].value_counts()
s[s == most_count].index
This returns
Index([u'FM-15'], dtype='object')
But I feel this is to ugly, and I don't know how to use this Index() object to query df. I only know something like df = df[(df['type'] == 'FM-15')].
Use argmax:
lbl = df['type'].value_counts().argmax()
To query,
df.query("type==#lbl")
I am reading in a set of data using pandas and plotting this using matplotlib. One column is a "category", eg "Sports", "Entertainment", but for some rows this is marked "Random", which means I need to distribute this value and add it randomly to one column. Ideally I would like to do this in the dataframe so that all values would be distributed.
My basic graph code is as follows :
df.category.value_counts().plot(kind="barh", alpha=a_bar)
title("Category Distribution")
The behaviour I would like is
If category == "Random"{
Assign this value to another column at random.
}
How can I accomplish this?
possibly:
# take the original value_counts, drop 'Random'
ts1 = df.category.value_counts()
rand_cnt = ts1.random
ts1.drop('Random', inplace=True)
# randomly choose from the other categories
ts2 = pd.Series(np.random.choice(ts1.index, rand_cnt)).value_counts()
# align the two series, and add them up
ts2 = ts2.reindex_like(ts1).fillna(0)
(ts1 + ts2).plot(kind='barh')
if you want to modify the original data-frame, then
idx = df.category == 'Random'
xs = df.category[~idx].unique() # all other categories
# randomly assign to categories which are 'Random'
df.category[idx] = np.random.choice(xs, idx.sum())