Pandas coarsen categorical variable

Pandas coarsen categorical variable - python

Let's say I have a categorical variable with the following values, given by calling unique() on the column in the dataframe:
Categories (7, object): [0-2, 6-8, 9-11, 3-5, 15-17, 12-14, 24-26]
and that I have the following occurrences for each of these categories given by calling value_counts():
0-2 209
3-5 34
6-8 17
9-11 7
15-17 6
12-14 3
24-26 1
what would be a good way to coarsen/compress these categories into two new categories "high" and "low"?

This is using pd.cut with the right value of the range, and cut them into two , also you can using qcut to get different cut result
groupkey=pd.cut(s.index.str.split('-').str[-1].astype(int),2,labels=['low','high'])
s.groupby(groupkey).sum()
low 270
high 7
Name: v, dtype: int64

Related

Special case of counting empty cells "before" an occupied cell in Pandas

Pandas question here.
I have a specific dataset in which we are sampling subjective ratings several times over a second. The information is sorted as below. What I need is a way to "count" the number of blank cells before every "second" (i.e. "1" in the second's column that occur at regular intervals), so I can feed that value into a greatest common factor equation and create somewhat of a linear extrapolation based on milliseconds. In the example below that number would be "2" and I would feed that into the GCF formula. The end goal is to make a more accurate/usable timestamp. Sampling rates may vary by dataset.
index
rating
seconds
1
26
2
28
3
30
1
4
33
5
40
6
45
1
7
50
8
48
9
49
1

If you just want to count the number of NaNs before the first 1:
df['seconds'].isna().cummin().sum()
If you have another value (e.g. empty string)
df['seconds'].eq('').cummin().sum()
Output: 2
Or, if you have a range Index:
df['seconds'].first_valid_index()

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!

You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

Convert GroupBy object to Dataframe (pandas)

I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).

Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df

Calculate column values in pandas based on previous rows of data in another column

Let's say I have a table with two columns: Date and Amount. Number of rows are not more than 3000.
Row Date Amount
1 15/05/2021 248
2 16/05/2021 115
3 17/05/2021 387
4 18/05/2021 214
5 19/05/2021 678
6 20/05/2021 489
7 21/05/2021 875
8 22/05/2021 123
................
I need to add a third column which will calculate the trim mean values based on the Amount column.
I will be using this function: my_table['TrimMean'] = stats.trim_mean(my_table['Amount'], 0.1), but adapted for my problem.
The problem is that this is not a fixed range, but a dynamic one, following this logic: for each row in my table, the trim mean value will be calculated based on the previous 90 values of the Amount column, starting from the row above current row. If there are less that 90 values, then calculate with whatever amount of rows is available.
e.g. TrimMean[1000]=stats.trim_mean(array from column Amount containing values from rows 910 to 999) TrimMean[12]=stats.trim_mean(array from column Amount containing values from rows 1 to 11)
Hope that makes sense.
Is there any way I can calculate this in a simple way, without going through row by row iteration?

We can calculate the trim_mean by applying the function over a rolling window of size 90 and min_periods=1
from scipy.stats import trim_mean
df['Amount'].rolling(90, min_periods=1).apply(trim_mean, args=(0.1, )).shift()
0 NaN
1 248.000000
2 181.500000
3 250.000000
4 241.000000
5 328.400000
6 355.166667
7 429.428571
Name: Amount, dtype: float64

Python: How to find the number of items in each point on scatterplot and produce list?

Right now I have a dataset of 1206 participants who have each endorsed a certain number of traumatic experiences and a number of symptoms associated with the trauma.
This is part of my dataframe (full dataframe is 1206 rows long):
SubjectID
PTSD_Symptom_Sum
PTSD_Trauma_Sum
1223
3
5
1224
4
2
1225
2
6
1226
0
3
I have two issues that I am trying to figure out:
I was able to create a scatter plot, but I can't tell from this plot how many participants are in each data point. Is there any easy way to see the number of subjects in each data point?
I used this code to create the scatterplot:
plt.scatter(PTSD['PTSD_Symptom_SUM'], PTSD['PTSD_Trauma_SUM'])
plt.title('Trauma Sum vs. Symptoms')
plt.xlabel('Symptoms')
plt.ylabel('Trauma Sum')
I haven't been able to successfully produce a list of the number of people endorsing each pair of items (symptoms and trauma number). I am able to run this code to create the counts for the number of people in each category:
:
count_sum= PTSD['PTSD_SUM'].value_counts()
count_symptom_sum= PTSD['PTSD_symptom_SUM'].value_counts()
print(count_sum)
print(count_symptom_sum)
Which produces this output:
0 379
1 371
2 248
3 130
4 47
5 17
6 11
8 2
7 1
Name: PTSD_SUM, dtype: int64
0 437
1 418
2 247
3 74
4 23
5 4
6 3
Name: PTSD_symptom_SUM, dtype: int64
Is it possible to alter the code to count the number of people endorsing each pair of items (symptom number and trauma number)? If not, are there any functions that would allow me to do this?

You could create a new dataset with the counts of each pair 'PTSD_SUM', 'PTSD_Symptom_SUM' with:
counts = PTSD.groupby(by=['PTSD_symptom_SUM', 'PTSD_SUM']).size().to_frame('size').reset_index()
and then use Seaborn like this:
import seaborn as sns
sns.scatterplot(data=counts, x="PTSD_symptom_SUM", y="PTSD_SUM", hue="size", size="size")
To obtain something like this:

If I understood properly, your dataframe is:
SubjectID TraumaSum Symptoms
1 1 5
2 3 4
...
So you just need:
dataset.groupby(by=['PTSD_SUM', 'PTSD_Symptom_SUM']).count()
This line will return you the count for each unique value

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas coarsen categorical variable - python

This is using pd.cut with the right value of the range, and cut them into two , also you can using qcut to get different cut result groupkey=pd.cut(s.index.str.split('-').str[-1].astype(int),2,labels=['low','high']) s.groupby(groupkey).sum() low 270 high 7 Name: v, dtype: int64

Related

Special case of counting empty cells "before" an occupied cell in Pandas

Iterate over certain columns with unique values and generate plots python

Convert GroupBy object to Dataframe (pandas)

Calculate column values in pandas based on previous rows of data in another column

Python: How to find the number of items in each point on scatterplot and produce list?

Categories

Resources