This question already has answers here:
How to select rows in a DataFrame between two values, in Python Pandas?
(7 answers)
Closed 1 year ago.
Example df
How do I create a histogram that in this case only uses the range of 2–5 points, instead of the entire points data range of 1–6?
I'm trying to only display the average data spread, not the extreme areas. Is there maybe a function to zoom in to the significant ranges? And is there a smart way to describe those?
For your specific data, you can first filter your DataFrame, then call .hist(). Note that Series.between(left, right) includes both the left and right values:
df[df['points'].between(2, 5)].hist()
Related
This question already has answers here:
seaborn heatmap color scheme based on row values
(1 answer)
Coloring Cells in Pandas
(4 answers)
Closed 5 months ago.
I have a table (which is currently produced in Excel), where a row wise comparison is made and each cell in a row is ranked from lowest to highest. The best score is strong green, the second best score is less green and the worst score is red. For cells with an equal score, the color of the cells will also be similar and based on their shared rank.
For some rows, the ranking is based on a ascending score, while some rows have a descending ranking.
How can this be done using Python? Do you know any modules that are capable of doing something similar? I've used Seaborn for other heatmaps, but none of them were based on a row wise comparison.
Any ideas?
The colors are not important. I just want to know how to rank the cells of each row compared to each other.
Use background_gradient. The RdYlGn map sound like it matches your description. There won't be a 100% reproduction of Excel's color map.
df.style.background_gradient("RdYlGn", axis=1)
List of color maps: https://matplotlib.org/stable/tutorials/colors/colormaps.html
This question already has answers here:
Pandas how to use pd.cut()
(5 answers)
Closed 6 months ago.
I am using Pandas cut to bin certain values in ranges according to a column. I am using user defined bins i.e the ranges are being passed as array.
df['Range'] = pd.cut(df.TOTAL, bins=[0,100,200,300,400,450,500,600,700,800,900,1000,2000])
However the values I have are ranging till 100000. This restricts the values to 2000 as an upper limit, and I am losing values greater than 2000. I want to keep an interal for greater than 2000. Is there any way to do this?
Let's add np.inf to end of your bin list:
pd.cut(df.TOTAL, bins=[0,100,200,300,400,450,500,600,700,800,900,1000,2000,np.inf])
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I have used the following code to make a distplot.
data_agg = data.groupby('HourOfDay')['travel_time'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,3))
sns.pointplot(data.HourOfDay.values, data.travel_time.values)
plt.show()
However I want to choose hours above 8 only and not 0-7. How do I proceed with that?
What about filtering first?
data_filtered = data[data['HourOfDay'] > 7]
# depending of the type of the column of date
data_agg = data_filtered.groupby('HourOfDay')['travel_time'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,3))
Sns.pointplot(data_filtered.HourOfDay.values, data_filtered.travel_time.values)
plt.show()
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
So here's my daily challenge :
I have an Excel file containing a list of streets, and some of those streets will be doubled (or tripled) based on their road type. For instance :
In another Excel file, I have the street names (without duplicates) and their mean distances between features such as this :
Both Excel files have been converted to pandas dataframes as so :
duplicates_df = pd.DataFrame()
duplicates_df['Street_names'] = street_names
dist_df=pd.DataFrame()
dist_df['Street_names'] = names_dist_values
dist_df['Mean_Dist'] = dist_values
dist_df['STD'] = std_values
I would like to find a way to append the values of mean distance and STD many times in the duplicates_df whenever a street has more than one occurence, but I am struggling with the proper syntax. This is probably an easy fix, but I've never done this before.
The desired output would be :
Any help would be greatly appreciated!
Thanks again!
pd.merge(duplicates_df, dist_df, on="Street_names")
This question already has answers here:
Sorted bar charts with pandas/matplotlib or seaborn
(2 answers)
Closed 5 years ago.
I have a dataframe that looks like this:
docs_df2.sample(10)
And the dtypes are:
filetype object
hash object
num_users int64
num_tags int64
dtype: object
Now I want to see what the distribution of num_tags look like, so I plot the count() of the dataframe grouped by num_tags (hash is unique):
So far so good, but I want a histogram now, so that I can see clearly the power law nature of my data, I get results by I think they're sort of plotted the wrong way around:
docs_df.groupby('num_tags')['hash'].count().plot(kind='hist'):
This is not what I want, though.
What I would like is to have the different "types" of num_tags (all 31 of them) in the x-axis ordered by their frequency, and the actual frequency in the y-axis.
Something like this:
What you are trying to make is not actually a histogram. A histogram would be used to display the frequency (count) of some measure while fitting a range of measures into a specified number of bins. You have already counted the data. What you want is a sorted bar graph. As an example, since I can't use your data...
df= pd.DataFrame(np.random.randint(1,10,(1000,2)),columns=['num_users','num_tags'])
df.groupby('num_tags').count()['num_users'].plot(kind='bar')
Now we just need to sort the bars
df.groupby('num_tags').count()['num_users'].sort_values(ascending=False).plot(kind='bar')