Implementation of Plotly on pandas dataframe from pyspark transformation - python

I'd like to produce plotly plots using pandas dataframes. I am struggling on this topic.
Now, I have this:
AGE_GROUP shop_id count_of_member
0 10 1 40
1 10 12 57615
2 20 1 186
4 30 1 175
5 30 12 322458
6 40 1 171
7 40 12 313758
8 50 1 158
10 60 1 168
Some shop might not have a record. As an example, plotly will need x=[1,2,3], y=[4,5,6]. If my input is x=[1,2,3] and y=[4,5], then x and y is not the same size and an exception will be raised. I need to add a null value record for the missing shop_id. So, I need this:
AGE_GROUP shop_id count_of_member
0 10 1 40
1 10 12 57615
2 20 1 186
3 20 12 0
4 30 1 175
5 30 12 322458
6 40 1 171
7 40 12 313758
8 50 1 158
9 50 12 0
10 60 1 168
11 60 12 0
For each age_group, I need to have 2 shop_id since the unique set of shop_id is 1 and 12
if there are 10 age_group, 20 rows will be shown.
For example:
AGE_GROUP shop_id count_of_member
1 10 12 57615
2 20 1 186
3 30 1 175
4 40 1 171
5 40 12 313758
6 50 1 158
7 60 1 168
there are 2 unique shop_id: 1 and 12 and 6 different age_group: 10,20,30,40,50,60
in age_group 10: only shop_id 12 is exists but no shop_id 1.
So, I need to have a new record to show the count_of_member of age_group 10 of shop_id 1 is 0.
The finally dataframe i will get should be:
AGE_GROUP shop_id count_of_member
1 10 12 57615
**1 10 1 0**
2 20 1 186
**2 20 12 0**
3 30 1 175
**3 30 12 0**
4 40 1 171
5 40 12 313758
6 50 1 158
**6 50 12 0**
7 60 12 0
7 60 1 168
** are the new added rows
How can i implement this transformation?

How can i implement this transformation?
First of all, you don't have to.
When used correctly, plotly has got a wide array of approaches where you can visualize your dataset as it is when your data look like yours in the third sample:
AGE_GROUP shop_id count_of_member
1 10 12 57615
2 20 1 186
3 30 1 175
4 40 1 171
5 40 12 313758
6 50 1 158
7 60 1 168
There's no need to apply pandas to get to the structure of the fourth sample. You're not to clear on what you'd like to do with this sample, but I suspect you'd like to show the accumulated count_of_member per age group split by shop_id like this?
You may wonder why the blue bars for shop_id1 isn't showing. But that's just because the size of the numbers are so hugely different. If you replace the miniscule count_of_member for shop_id=1 to something more comparable for those of shop_id=12, you'll get this instead:
Below is a complete code snippet where the altered dataset has been commented out. The dataset used is still the same as in your third data sample.
Complete code:
# imports
import plotly.graph_objects as go
import pandas as pd
data = {'AGE_GROUP': {0: 10, 1: 10, 2: 20, 4: 30, 5: 30, 6: 40, 7: 40, 8: 50, 10: 60},
'shop_id': {0: 1, 1: 12, 2: 1, 4: 1, 5: 12, 6: 1, 7: 12, 8: 1, 10: 1},
'count_of_member': {0: 40,
1: 57615,
2: 186,
4: 175,
5: 322458,
6: 171,
7: 313758,
8: 158,
10: 168}}
## Optional dataset
# data = {'AGE_GROUP': {0: 10, 1: 10, 2: 20, 4: 30, 5: 30, 6: 40, 7: 40, 8: 50, 10: 60},
# 'shop_id': {0: 1, 1: 12, 2: 1, 4: 1, 5: 12, 6: 1, 7: 12, 8: 1, 10: 1},
# 'count_of_member': {0: 40,
# 1: 57615,
# 2: 186000,
# 4: 175000,
# 5: 322458,
# 6: 171000,
# 7: 313758,
# 8: 158000,
# 10: 168000}}
# # Create DataFrame
df = pd.DataFrame(data)
# Manage shop_id
shops = df['shop_id'].unique()
# set up plotly figure
fig = go.Figure()
# add one trace per NAR type and show counts per hospital
for shop in shops:
# subset dataframe by shop_id
df_ply=df[df['shop_id']==shop]
# add trace
fig.add_trace(go.Bar(x=df_ply['AGE_GROUP'], y=df_ply['count_of_member'], name='shop_id'+str(shop)))
fig.show()
EDIT:
If you for some reason still need to structure your data as in your fourth sample, I suggest that you raise another question and specifically tag it with [pandas] and [python] only, and exclusively focus on the data transformation part of the question.

Related

Map counts of a numerical column from a new DataFrame to the bin range column of training data

I am trying to get the count of Age column and append it to my existing bin-range column created. I am able to do it for the training df and want to do it for prediction data. How do I map values of count of Age column from prediction data to to Age_bin column in my training data? The first one is my output DF whereas the 2nd one is the sample DF. I can get the count using value_counts() for the file I am reading.
First image - bin and count from training data
Second image - Training data
Third image - Prediction data
Fourth image - Final output
.
.
The Data
import pandas as pd
data = {
0: 0,
11: 1500,
12: 1000,
22: 3000,
32: 35000,
34: 40000,
44: 55000,
65: 7000,
80: 8000,
100: 1000000,
}
df = pd.DataFrame(data.items(), columns=['Age', 'Salary'])
Age Salary
0 0 0
1 11 1500
2 12 1000
3 22 3000
4 32 35000
5 34 40000
6 44 55000
7 65 7000
8 80 8000
9 100 1000000
The Code
bins = [-0.1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# create a "binned" column
df['binned'] = pd.cut(df['Age'], bins)
# add bin count
df['count'] = df.groupby('binned')['binned'].transform('count')
The Output
Age Salary binned count
0 0 0 (-0.1, 10.0] 1
1 11 1500 (10.0, 20.0] 2
2 12 1000 (10.0, 20.0] 2
3 22 3000 (20.0, 30.0] 1
4 32 35000 (30.0, 40.0] 2
5 34 40000 (30.0, 40.0] 2
6 44 55000 (40.0, 50.0] 1
7 65 7000 (60.0, 70.0] 1
8 80 8000 (70.0, 80.0] 1
9 100 1000000 (90.0, 100.0] 1

creating a dataframe and based on 2 dataframe sets that have different lengths

I have 2 dataframe sets , I want to create a third one. I am trying to to write a code that to do the following :
if A_pd["from"] and A_pd["To"] is within the range of B_pd["from"]and B_pd["To"] then add to the C_pd dateframe A_pd["from"] and A_pd["To"] and B_pd["Value"].
if the A_pd["from"] is within the range of B_pd["from"]and B_pd["To"] and A_pd["To"] within the range of B_pd["from"]and B_pd["To"] of teh next row , then i want to split the range A_pd["from"] and A_pd["To"] to 2 ranges (A_pd["from"] and B_pd["To"]) and ( B_pd["To"] and A_pd["To"] ) and the corresponded B_pd["Value"].
I created the following code:
import pandas as pd
A_pd = {'from':[0,20,80,180,250],
'To':[20, 50,120,210,300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0,20,100,200],
'To':[20, 100,200,300],
'Value':[20, 17,15,12]}
B_pd=pd.DataFrame(B_pd)
for i in range(len(A_pd)):
numberOfIntrupt=0
for j in range(len(B_pd)):
if A_pd["from"].values[i] >= B_pd["from"].values[j] and A_pd["from"].values[i] > B_pd["To"].values[j]:
numberOfIntrupt+=1
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols, index=range(len(A_pd)+numberOfIntrupt))
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a=A_pd ["from"].values[i]
b=A_pd["To"].values[i]
c_eval=B_pd["Value"].values[j]
range_s=B_pd["from"].values[j]
range_f=B_pd["To"].values[j]
if a >= range_s and a <= range_f and b >= range_s and b <= range_f :
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=b
C_dp['C_value'].loc[i]=c_eval
elif a >= range_s and b > range_f:
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=range_f
C_dp['C_value'].loc[i]=c_eval
C_dp['C_from'].loc[i+1]=range_f
C_dp['C_To'].loc[i+1]=b
C_dp['C_value'].loc[i+1]=B_pd["Value"].values[j+1]
print(C_dp)
The current result is C_dp:
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 180 200 15
4 250 300 12
5 200 300 12
6 NaN NaN NaN
7 NaN NaN NaN
the expected should be :
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12
Thank you a lot for the support
I'm sure there is a better way to do this without loops, but this will help your logic flow.
import pandas as pd
A_pd = {'from':[0, 20, 80, 180, 250],
'To':[20, 50, 120, 210, 300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0, 20, 100, 200],
'To':[20, 100,200, 300],
'Value':[20, 17, 15, 12]}
B_pd=pd.DataFrame(B_pd)
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols)
spillover = False
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a_from = A_pd["from"].values[i]
a_to = A_pd["To"].values[i]
b_from = B_pd["from"].values[j]
b_to = B_pd["To"].values[j]
b_value = B_pd['Value'].values[j]
if (a_from >= b_to):
# a_from outside b range
continue # next b
elif (a_from >= b_from):
# a_from within b range
if a_to <= b_to:
C_dp = C_dp.append({"C_from": a_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
break # next a
else:
C_dp = C_dp.append({"C_from": a_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
if j < len(B_pd):
spillover = True
continue
if spillover:
if a_to <= b_to:
C_dp = C_dp.append({"C_from": b_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
spillover = False
break
else:
C_dp = C_dp.append({"C_from": b_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
spillover = True
continue
print(C_dp)
Output
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12

Pandas assign group numbers for each time bin

I have a pandas dataframe that looks like below.
Key Name Val1 Val2 Timestamp
101 A 10 1 01-10-2019 00:20:21
102 A 12 2 01-10-2019 00:20:21
103 B 10 1 01-10-2019 00:20:26
104 C 20 2 01-10-2019 14:40:45
105 B 21 3 02-10-2019 09:04:06
106 D 24 3 02-10-2019 09:04:12
107 A 24 3 02-10-2019 09:04:14
108 E 32 2 02-10-2019 09:04:20
109 A 10 1 02-10-2019 09:04:22
110 B 10 1 02-10-2019 10:40:49
Starting from the earliest timestamp, that is, '01-10-2019 00:20:21', I need to create time bins of 10 seconds each and assign same group number to all the rows having timestamp fitting in a time bin.
The output should look as below.
Key Name Val1 Val2 Timestamp Group
101 A 10 1 01-10-2019 00:20:21 1
102 A 12 2 01-10-2019 00:20:21 1
103 B 10 1 01-10-2019 00:20:26 1
104 C 20 2 01-10-2019 14:40:45 2
105 B 21 3 02-10-2019 09:04:06 3
106 D 24 3 02-10-2019 09:04:12 4
107 A 24 3 02-10-2019 09:04:14 4
108 E 32 2 02-10-2019 09:04:20 4
109 A 10 1 02-10-2019 09:04:22 5
110 B 10 1 02-10-2019 10:40:49 6
First time bin: '01-10-2019 00:20:21' to '01-10-2019 00:20:30',
Next time bin: '01-10-2019 00:20:31' to '01-10-2019 00:20:40',
Next time bin: '01-10-2019 00:20:41' to '01-10-2019 00:20:50',
Next time bin: '01-10-2019 00:20:51' to '01-10-2019 00:21:00',
Next time bin: '01-10-2019 00:21:01' to '01-10-2019 00:21:10'
and so on.. Based on these time bins, 'Group' is assigned for each row.
It is not mandatory to have consecutive group numbers(If a time bin is not present, it's ok to skip that group number).
I have generated this using for loop, but it takes lot of time if data is spread across months.
Please let me know if this can be done as a pandas operation using a single line of code. Thanks.
Here is an example without loop. The main approach is round up seconds to specific ranges and use ngroup().
02-10-2019 09:04:12 -> 02-10-2019 09:04:11
02-10-2019 09:04:14 -> 02-10-2019 09:04:11
02-10-2019 09:04:20 -> 02-10-2019 09:04:11
02-10-2019 09:04:21 -> 02-10-2019 09:04:21
02-10-2019 09:04:25 -> 02-10-2019 09:04:21
...
I use a new temporary column to find some specific range.
df = pd.DataFrame.from_dict({
'Name': ('A', 'A', 'B', 'C', 'B', 'D', 'A', 'E', 'A', 'B'),
'Val1': (1, 2, 1, 2, 3, 3, 3, 2, 1, 1),
'Timestamp': (
'2019-01-10 00:20:21',
'2019-01-10 00:20:21',
'2019-01-10 00:20:26',
'2019-01-10 14:40:45',
'2019-02-10 09:04:06',
'2019-02-10 09:04:12',
'2019-02-10 09:04:14',
'2019-02-10 09:04:20',
'2019-02-10 09:04:22',
'2019-02-10 10:40:49',
)
})
# convert str to Timestamp
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
# your specific ranges. customize if you need
def sec_to_group(x):
if 0 <= x.second <= 10:
x = x.replace(second=0)
elif 11 <= x.second <= 20:
x = x.replace(second=11)
elif 21 <= x.second <= 30:
x = x.replace(second=21)
elif 31 <= x.second <= 40:
x = x.replace(second=31)
elif 41 <= x.second <= 50:
x = x.replace(second=41)
elif 51 <= x.second <= 59:
x = x.replace(second=51)
return x
# new column formated_dt(temporary) with formatted seconds
df['formated_dt'] = df['Timestamp'].apply(sec_to_group)
# group by new column + ngroup() and drop
df['Group'] = df.groupby('formated_dt').ngroup()
df.drop(columns=['formated_dt'], inplace=True)
print(df)
Output:
# Name Val1 Timestamp Group
# 0 A 1 2019-01-10 00:20:21 0 <- ngroup() calculates from 0
# 1 A 2 2019-01-10 00:20:21 0
# 2 B 1 2019-01-10 00:20:26 0
# 3 C 2 2019-01-10 14:40:45 1
# 4 B 3 2019-02-10 09:04:06 2
# ....
Also you can try to use TimeGrouper or resample.
Hope this helps.

How to split day, hour, minute and second data in a huge Pandas data frame?

I'm new to Python and I'm working on a project for a Data Science class I'm taking. I have a big csv file (around 190 million lines, approx. 7GB of data) and I need, first, to do some data preparation.
Full disclaimer: data here is from this Kaggle competition.
A picture from Jupyter Notebook with headers follows. Although it reads full_data.head(), I'm using a 100,000-lines sample just to test code.
The most important column is click_time. The format is: dd hh:mm:ss. I want to split this in 4 different columns: day, hour, minute and second. I've reached a solution that works fine with this little file but it takes too long to run on 10% of real data, let alone on top 100% of real data (hasn't even been able to try that since just reading the full csv is a big problem right now).
Here it is:
# First I need to split the values
click = full_data['click_time']
del full_data['click_time']
click = click.str.replace(' ', ':')
click = click.str.split(':')
# Then I transform everything into integers. The last piece of code
# returns an array of lists, one for each line, and each list has 4
# elements. I couldn't figure out another way of making this conversion
click = click.apply(lambda x: list(map(int, x)))
# Now I transform everything into unidimensional arrays
day = np.zeros(len(click), dtype = 'uint8')
hour = np.zeros(len(click), dtype = 'uint8')
minute = np.zeros(len(click), dtype = 'uint8')
second = np.zeros(len(click), dtype = 'uint8')
for i in range(0, len(click)):
day[i] = click[i][0]
hour[i] = click[i][1]
minute[i] = click[i][2]
second[i] = click[i][3]
del click
# Transforming everything to a Pandas series
day = pd.Series(day, index = full_data.index, dtype = 'uint8')
hour = pd.Series(hour, index = full_data.index, dtype = 'uint8')
minute = pd.Series(minute, index = full_data.index, dtype = 'uint8')
second = pd.Series(second, index = full_data.index, dtype = 'uint8')
# Adding to data frame
full_data['day'] = day
del day
full_data['hour'] = hour
del hour
full_data['minute'] = minute
del minute
full_data['second'] = second
del second
The result is ok, it's what I want, but there has to be a faster way doing this:
Any ideas on how to improve this implementation? If one is interested in the dataset, this is from the test_sample.csv: https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
Thanks a lot in advance!!
EDIT 1: Following #COLDSPEED request, I provide the results of full_data.head.to_dict():
{'app': {0: 12, 1: 25, 2: 12, 3: 13, 4: 12},
'channel': {0: 497, 1: 259, 2: 212, 3: 477, 4: 178},
'click_time': {0: '07 09:30:38',
1: '07 13:40:27',
2: '07 18:05:24',
3: '07 04:58:08',
4: '09 09:00:09'},
'device': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'ip': {0: 87540, 1: 105560, 2: 101424, 3: 94584, 4: 68413},
'is_attributed': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'os': {0: 13, 1: 17, 2: 19, 3: 13, 4: 1}}
Convert to timedelta and extract components:
v = df.click_time.str.split()
df['days'] = v.str[0].astype(int)
df[['hours', 'minutes', 'seconds']] = (
pd.to_timedelta(v.str[-1]).dt.components.iloc[:, 1:4]
)
df
app channel click_time device ip is_attributed os days hours \
0 12 497 07 09:30:38 1 87540 0 13 7 9
1 25 259 07 13:40:27 1 105560 0 17 7 13
2 12 212 07 18:05:24 1 101424 0 19 7 18
3 13 477 07 04:58:08 1 94584 0 13 7 4
4 12 178 09 09:00:09 1 68413 0 1 9 9
minutes seconds
0 30 38
1 40 27
2 5 24
3 58 8
4 0 9
One solution is to first split by whitespace, then convert to datetime objects, then extract components directly.
import pandas as pd
df = pd.DataFrame({'click_time': ['07 09:30:38', '07 13:40:27', '07 18:05:24',
'07 04:58:08', '09 09:00:09', '09 01:22:13',
'09 01:17:58', '07 10:01:53', '08 09:35:17',
'08 12:35:26']})
df[['day', 'time']] = df['click_time'].str.split().apply(pd.Series)
df['datetime'] = pd.to_datetime(df['time'])
df['day'] = df['day'].astype(int)
df['hour'] = df['datetime'].dt.hour
df['minute'] = df['datetime'].dt.minute
df['second'] = df['datetime'].dt.second
df = df.drop(['time', 'datetime'], 1)
Result
click_time day hour minute second
0 07 09:30:38 7 9 30 38
1 07 13:40:27 7 13 40 27
2 07 18:05:24 7 18 5 24
3 07 04:58:08 7 4 58 8
4 09 09:00:09 9 9 0 9
5 09 01:22:13 9 1 22 13
6 09 01:17:58 9 1 17 58
7 07 10:01:53 7 10 1 53
8 08 09:35:17 8 9 35 17
9 08 12:35:26 8 12 35 26

Removing rows below first line that meets threshold in pandas dataframe

I have a df that looks like:
import pandas as pd
import numpy as np
d = {'Hours':np.arange(12, 97, 12),
'Average':np.random.random(8),
'Count':[500, 250, 125, 75, 60, 25, 5, 15]}
df = pd.DataFrame(d)
This df has a decrease number of cases for each row. After the count drops below a certain threshold, I'd like to drop off the remainder, for example after a < 10 case threshold was reached.
Starting:
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
6 0.302894 5 84
7 0.418912 15 96
Finished (everything after row 6 removed):
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
We can use the index generated from the boolean index and slice the df using iloc:
In [58]:
df.iloc[:df[df.Count < 10].index[0]]
Out[58]:
Average Count Hours
0 0.183016 500 12
1 0.046221 250 24
2 0.687945 125 36
3 0.387634 75 48
4 0.167491 60 60
5 0.660325 25 72
Just to break down what is happening here
In [54]:
# use a boolean mask to index into the df
df[df.Count < 10]
Out[54]:
Average Count Hours
6 0.244839 5 84
In [56]:
# we want the index and can subscript the first element using [0]
df[df.Count < 10].index
Out[56]:
Int64Index([6], dtype='int64')

Categories