Counting the number of customers by values in a second series - python

I have imported a list of customers into python to run some RFM analysis, this adds a new field to the data for the RFM Class, so now my data looks like this:
customer RFMClass
0 0001914f-4655-4148-a1dc-1f25ca6d1f15 343
1 0002e50a-5551-4d9a-8734-76307dfe2131 341
2 00039977-512e-47ad-b929-170f18a1b14a 442
3 000693ff-2c61-425c-97c1-0286c874dd2f 443
4 00095dc2-7f37-48b0-894f-910d90cbbee2 142
5 000b748b-7ea0-48f2-a875-5f6cb95561d9 141
...
I'd like to plot a histogram showing the number of customers in each RFM Class, how can I get a count of the number of distinct customers ID's per class?
I tried adding a 1 to every row with summary['number'] = 1 thinking that it might be easier to count these rather than the customer ID's, as these have already been de-duped in my code, but I can't figure out how to sum these per RFM Class either.
Any thoughts on how I could do this?

I worked this out by using .groupby on my RFM class and summing the 'number' I assigned to each row:
byhour = df.groupby(['Hour']).agg({'Orders': 'sum'})
print(byhour)
This then produces the desired output:
Orders
Hour
0 902
1 438
2 307
3 162
4 149
5 233
6 721

Related

Python: How to find the number of items in each point on scatterplot and produce list?

Right now I have a dataset of 1206 participants who have each endorsed a certain number of traumatic experiences and a number of symptoms associated with the trauma.
This is part of my dataframe (full dataframe is 1206 rows long):
SubjectID
PTSD_Symptom_Sum
PTSD_Trauma_Sum
1223
3
5
1224
4
2
1225
2
6
1226
0
3
I have two issues that I am trying to figure out:
I was able to create a scatter plot, but I can't tell from this plot how many participants are in each data point. Is there any easy way to see the number of subjects in each data point?
I used this code to create the scatterplot:
plt.scatter(PTSD['PTSD_Symptom_SUM'], PTSD['PTSD_Trauma_SUM'])
plt.title('Trauma Sum vs. Symptoms')
plt.xlabel('Symptoms')
plt.ylabel('Trauma Sum')
I haven't been able to successfully produce a list of the number of people endorsing each pair of items (symptoms and trauma number). I am able to run this code to create the counts for the number of people in each category:
:
count_sum= PTSD['PTSD_SUM'].value_counts()
count_symptom_sum= PTSD['PTSD_symptom_SUM'].value_counts()
print(count_sum)
print(count_symptom_sum)
Which produces this output:
0 379
1 371
2 248
3 130
4 47
5 17
6 11
8 2
7 1
Name: PTSD_SUM, dtype: int64
0 437
1 418
2 247
3 74
4 23
5 4
6 3
Name: PTSD_symptom_SUM, dtype: int64
Is it possible to alter the code to count the number of people endorsing each pair of items (symptom number and trauma number)? If not, are there any functions that would allow me to do this?
You could create a new dataset with the counts of each pair 'PTSD_SUM', 'PTSD_Symptom_SUM' with:
counts = PTSD.groupby(by=['PTSD_symptom_SUM', 'PTSD_SUM']).size().to_frame('size').reset_index()
and then use Seaborn like this:
import seaborn as sns
sns.scatterplot(data=counts, x="PTSD_symptom_SUM", y="PTSD_SUM", hue="size", size="size")
To obtain something like this:
If I understood properly, your dataframe is:
SubjectID TraumaSum Symptoms
1 1 5
2 3 4
...
So you just need:
dataset.groupby(by=['PTSD_SUM', 'PTSD_Symptom_SUM']).count()
This line will return you the count for each unique value

I need filtered values with index name from data

I am getting values what I have filtered from below script but not getting index name can help me to get with index name
chas = df.CHAS[df.CHAS>=1]
chas
for above script am getting result like below
142 1
152 1
154 1
155 1
I need result like below
CHAS
141 1
152 1
154 1
155 1
IIUC, you want to obtain a DataFrame instead of a Series. Simple, just ask for a list of 1 column:
df[['CHAS']][df.CHAS >=1]
or even better:
df.loc[df['CHAS']>=1, [CHAS]]

Extract numbers from strings from a column in pandas dataframe

I have a dataframe called data, I am trying to clean one of the columns in the dataframe so I can convert the price into numerical values only.
This is how I'm filtering for the column to find those incorrect values.
data[data['incorrect_price'].astype(str).str.contains('[A-Za-z]')]
Incorrect_Price Occurences errors
23 99 cents 732 1
50 3 dollars and 49 cents 211 1
72 the price is 625 128 3
86 new price is 4.39 19 2
138 4 bucks 3 1
199 new price 429 13 1
225 price is 9.99 5 1
240 new price is 499 8 2
I have tried data['incorrect_Price'][20:51].str.findall(r"(\d+) dollars") and data['incorrect_Price'][20:51].str.findall(r"(\d+) cents") to find rows that have "cents" and "dollars" in them so I can extract the dollar and cents amount but haven't been able to incorporate this when iterating over all rows in the dataframe.
I would like the results to like look this:
Incorrect_Price Desired Occurences errors
23 99 cents .99 732 1
50 3 dollars and 49 cents 3.49 211 1
72 the price is 625 625 128 3
86 new price is 4.39 4.39 19 2
138 4 bucks 4.00 3 1
199 new price 429 429 13 1
225 price is 9.99 9.99 5 1
240 new price is 499 499 8 2
The task can be relatively easily solved as long as the strings Incorrect_Price retain the structure you present in the examples (numbers are not expressed in words).
Using regular expressions you can extract number part and optional "cent"/"cents" or "dollar"/"dollars" using an approach from similar SO question. The two main differences is that you are looking for pairs of numerical value and "cent[s]" or "dollar[s]" and that they potentially occur more than once.
import re
def extract_number_currency(value):
prices = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)
result = 0.0
for value, currency in prices:
partial = float(value)
if currency == 'cent':
result += partial / 100
else:
result += partial
return result
print(extract_number_currency('3 dollars and 49 cent'))
3.49
Now, what you need is to apply this function to all incorrect values in the column with prices in words. For simplicity I am applying it here to all values (but I am sure you will be able to deal with the subset):
data['Desired'] = data['Incorrect_Price'].apply(extract_number_currency)
Voila!
Breaking down of the regex '(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?'
There are two capture named groups (?P<name_of_the_capture_group> .... )
The first capture group (?P<value>[\d]*[.]?[\d]{1,2}) captures:
[\d] - digits
[\d]* - repeated 0 or more times
[.]? - followed by optional (?) dot
[\d]{1,2} - followed by a digit repeated from 1 to 2 times
\s* - denotes 0 or more whitespaces
Now the 2nd capture group which is much simpler: (?P<currency>cent|dollar)
cent|dollar - it boils down to alternative between cent and dollar strings being captured
s? is an optional plural of 'cent s' or 'dollar s'

Group data by two columns and count it using pandas

I am having the following data.
songs
play_event
In songs the data is as below:
song_id total_plays
1 2000
2 4532
3 9999
4 2343
And in play event the data is as below:
user_id song_id
102 1
103 4
102 1
102 3
104 2
102 1
For each time a song was played, there is a new entry, even is a song is played again.
With this data I want to:
Get total no. of time each user played each songs. For example, if user_id 102 played, the song_id 1 three times, as per above data. I want to have it grouped by the user_id with total count. Something like below:
user_id song_id count
102 1 3
102 3 1
103 4 1
104 2 1
I am thinking of using Pandas to do this. But I want to know if pandas is the right choice.
If its not pandas, then what should be my way forward.
If Pandas is the right choice, then:
The below code allows me to get the count either grouped by user or grouped by user_id how do we get the count grouped by user_id & song_id? See a sample code I tried below:
import pandas as pd
#Load data from csv file
data = pd.DataFrame.from_csv('play_events.csv')
# Gives how many entries per user
data['user_id'].value_counts()
# Gives how many entries per songs
data['song_id'].value_counts()
For your first problem, a simple groupby and value_counts does the trick. Note that everything after value_counts() in the code below is just to get it to an actual dataframe in the same format as your desired output.
counts = play_events.groupby('user_id')['song_id'].value_counts().to_frame('count').reset_index()
>>> counts
user_id song_id count
0 102 1 3
1 102 3 1
2 103 4 1
3 104 2 1
Then for your second problem (which you have deleted in your edited post, but I will leave just in case it is useful to you), you can loop through counts, grouping by user_id, and save each as csv:
for user, data in counts.groupby('user_id', as_index=False):
data.to_csv(str(user)+'_events.csv')
For your example dataframes, this gives you 3 csvs: 102_events.csv, 103_events.csv, and 103_events.csv. The first looks like:
user_id song_id count
0 102 1 3
1 102 3 1

Simplify query to just 'where' by making new column with pandas?

I have a column with SQL queries to a column. These are implemented on a function called Select_analysis
Form:
Select_analysis (input_shapefile, output_name, {where_clause}) # it takes until where.
Example:
SELECT * from OT # OT is a dataset
GROUP BY OT.CA # CA is a number that may exist many times.Therefore we group by that field.
HAVING ((Count(OT.OBJECTID))>1) # an id that appears more than once.
OT dataset
objectid CA
1 125
2 342
3 263
1 125
We group by CA.
About having: it is applied to the rows that have objectid more than once. Which is the objectid 1 in this example.
My idea is to make another column that will store a result that will be accessed with a simple where clause in the select_analysis function
example: OT dataset
objectid CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 263 1
1 125 2
So then can be:
Select_analysis(roads.shp,output.shp, count_of_objectid_aftergroupby > '1')
Notes
it has to be in such a way so that select analysis function is used in the end.
Assuming that you are pulling the data into pandas since it's tagged pandas, here's one possible solution:
df=pd.DataFrame({'objectID':[1,2,3,1],'CA':[125,342,463,125]}).set_index('objectID')
objectID CA
1 125
2 342
3 463
1 125
df['count_of_objectid_aftergroupby']=[df['CA'].value_counts().loc[x] for x in df['CA']]
objectID CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 463 1
1 125 2
The list comp does basically this:
pull the value counts for each item in df['CA'] as a series.
Use loc to index into the series at each value of 'CA' to find the count of that value
Put that item into a list
append that list as a new column

Categories