Finding frequency of items in cell of column pandas - python

I have DataFrame with almost 500 rows and 3 columns.
One of the columns has a string of dates and each cell has a unique date, but some cell have a common date and some cells are seem empty.
I'm trying to find the frequency of each day in a cell
df|Number_of_dates | Date
--|--------------------|---------------------
0 | 0.0 | []
1 | 3.0 | ['2006-01-01' '2006-03-22' '2019-07-29']
2 | 8.0 | ['2006-01-01' '2006-04-13' '2006-07-18' '2006-...
3 | 1.0 | ['2006-07-18']
4 | 1.0 | ['2019-07-29']
5 | 0.0 | []
6 | 397.0 | ['2019-01-02' '2019-01-03' '2019-01-04' '2019-...
Result:
df_1 |Date | Frequency
-----|------------ |---------------------
0 | 2006-01-01 |2
1 | 2006-03-22 |1
2 | 2006-04-13 |1
3 | 2006-07-18 |2
4 | 2019-07-29 |3
It would be very helpful if you could provide some guidance.
Thanks in advance
additional information:
I noticed that each cell has a string value instead of a list
Sample DataFrame
d = {"Date":[ "['2005-02-02' '2005-05-04' '2005-08-03' '2005-11-02' '2006-02-01' '2006-05-03']",
"['2006-01-31' '2006-02-01' '2006-03-16'\n '2006-06-13']",
"['2005-10-12' '2005-10-13' '2005-10-14'\n '2005-10-17']",
"[]",
"['2005-07-25' '2005-07-26' '2005-07-27'\n '2005-07-28' '2005-07-29' '2005-08-01' '2005-08-02' '2005-08-03'\n '2005-08-04' '2005-08-05']",
"['2005-03-15' '2005-03-16' '2005-03-17'\n '2005-03-18' '2005-03-21' '2005-03-22' '2005-03-23' '2005-03-24' \n'2005-03-28' '2005-03-29' '2005-03-30' '2005-03-31' '2005-04-01'\n '2005-04-04']",
"['2005-03-16' '2005-03-17' '2005-07-27'\n '2006-06-13']",
"['2005-02-02' '2005-05-04' '2005-03-16' '2005-03-17']",
"[]"
]
}
df = pd.DataFrame(d)

Use DataFrame.explode with GroupBy.size:
#create list from sample data
df['Date'] = df['Date'].str.strip('[]').str.split()
df_1 = df.explode('Date').groupby('Date').size().reset_index(name='Frequency')
print (df_1.head(10))
Date Frequency
0 '2005-02-02' 2
1 '2005-03-15' 1
2 '2005-03-16' 3
3 '2005-03-17' 3
4 '2005-03-18' 1
5 '2005-03-21' 1
6 '2005-03-22' 1
7 '2005-03-23' 1
8 '2005-03-24' 1
9 '2005-03-28' 1

Related

creating a function that takes into multiple dataframe columns and spilt value across the columns

I have a sales dataframe with customer data and sales team. I have a target calls based on the individual customer which I want to split across the sales teams
cust_id| total_calls_req| group_1_rep| group_2_rep| group_3_rep
34523 | 10 | 230429 | nan | 583985
34583 | 12 | 230429 | 539409 | 583985
34455 | 6 | 135552 | nan | nan
I want to create a function that splits the total_calls_req across each group based on whether or not there is a group_rep assigned.
If cust_id is assigned to 1 rep then the total_calls_req is all assigned to the rep in question
If cust_id is assigned to 2 reps then the total_calls_req is split between the two reps in question.
If cust_id is assigned to 3 reps then the total_calls_req is split randomly between the three reps in question and needs to be whole cards.
I want the end dataframe to look like this:
cust_id| total_calls_req| group_1_rep| group_2_rep| group_3_rep| group_1_rep_calls| group_2_rep_calls| group_3_rep_calls
34523 | 10 | 230429 | nan | 583985 | 5 | 0 | 5
34583 | 12 | 230429 | 539409 | 583985 | 6 | 3 | 3
34455 | 6 | 135552 | nan | nan | 6 | 0 | 0
Is there a way I can do that through a python function?
You can build a function which returns a serie with three elements according to the number of NaN values. I have based in this answer to get the Series, in that answer is used numpy.random.multinomial.
import numpy as np
def serie_split(row):
total_calls_req = row[0]
groups = row[1:]
numbers_nan = pd.notna(groups).sum()
if numbers_nan == len(groups):
s = pd.Series(np.random.multinomial(total_calls_req, [1/len(groups)] * len(groups)))
else:
s = pd.Series(groups)
s.loc[s.notna()] = total_calls_req/numbers_nan if (total_calls_req % 2) == 0 else np.random.multinomial(total_calls_req, [1/numbers_nan] * numbers_nan)
s.loc[s.isna()] = 0
return s
def get_rep_calls(df):
columns = df.filter(like='group_').add_suffix('_calls').columns
dfg = df[df.columns[1:]] # dfg is a dataframe only with the columns 'total_calls_req','group_1_rep', 'group_2_rep' and 'group_3_rep'
series = [serie_split(row) for row in dfg.to_numpy(dtype='object')]
for index in range(len(dfg)):
df.loc[index, columns] = series[index].values
get_rep_calls(df)
print(df)
Output (I have added an example in the last row with total_calls_req = 13):
cust_id
total_calls_req
group_1_rep
group_2_rep
group_3_rep
group_1_rep_calls
group_2_rep_calls
group_3_rep_calls
34523
10
230429
NaN
583985
5
0
5
34583
12
230429
539409
583985
4
7
1
34455
6
135552
NaN
NaN
6
0
0
12345
13
123456
NaN
583985
10
0
3
You can use this custom split function to split the calls among reps that are assigned to the customer. This uses the columns starting in "group_" to identify the assigned reps and count how many they are. In case they are more than two, the numpy.random.multinomial function enables to generate a random split.
import numpy as np
def split(s):
reps = (~s.filter(like='group_').isna()).astype(int).add_suffix('_calls')
total = reps.sum()
if total > 2: # remove this line and below for a better split across reps
return np.random.multinomial(s['total_calls_req'], [1/total]*total)
div, mod = divmod(int(s['total_calls_req']), total)
reps = reps*div # split evenly
reps[np.random.choice(np.flatnonzero(reps), mod)]+=1 # allocate remainder randomly
return reps
pd.concat([df, df.apply(split, axis=1)], axis=1)
output:
cust_id total_calls_req group_1_rep group_2_rep group_3_rep group_1_rep_calls group_2_rep_calls group_3_rep_calls
0 34523 10 230429 NaN 583985.0 5 0 5
1 34583 12 230429 539409.0 583985.0 4 2 6
2 34455 6 135552 NaN NaN 6 0 0

How do you identify which IDs have an increasing value over time in another column in a Python dataframe?

Lets say I have a data frame with 3 columns:
| id | value | date |
+====+=======+===========+
| 1 | 50 | 1-Feb-19 |
+----+-------+-----------+
| 1 | 100 | 5-Feb-19 |
+----+-------+-----------+
| 1 | 200 | 6-Jun-19 |
+----+-------+-----------+
| 1 | 500 | 1-Dec-19 |
+----+-------+-----------+
| 2 | 10 | 6-Jul-19 |
+----+-------+-----------+
| 3 | 500 | 1-Mar-19 |
+----+-------+-----------+
| 3 | 200 | 5-Apr-19 |
+----+-------+-----------+
| 3 | 100 | 30-Jun-19 |
+----+-------+-----------+
| 3 | 10 | 25-Dec-19 |
+----+-------+-----------+
ID column contains the ID of a particular person.
Value column contains the value of their transaction.
Date column contains the date of their transaction.
Is there a way in Python to identify ID 1 as the ID with the increasing value of transactions over time?
I'm looking for some way I can extract ID 1 as my desired ID with increasing value of transactions, filter out ID 2 because it doesn't have enough transactions to analyze a trend and also filter out ID 3 as it's trend of transactions is declining over time.
Perhaps group by the id, and check that the sorted values are the same whether sorted by values or by date:
>>> df.groupby('id').apply( lambda x:
... (
... x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value']
... ).all()
... )
id
1 True
2 True
3 False
dtype: bool
EDIT:
To make id=2 not True, we can do this instead:
>>> df.groupby('id').apply( lambda x:
... (
... (x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value'])
... & (len(x) > 1)
... ).all()
... )
id
1 True
2 False
3 False
dtype: bool
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
df = df.groupby('id').new.agg(['last'])
df
Output:
last
id
1 increase
2 --
3 decrease
Only increasing ID:
increasingList = df[(df['last']=='increase')].index.values
print(increasingList)
Result:
[1]
Assuming this won't happen
1 50
1 100
1 50
If so, then:
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'increase',
np.where(x.diff()<0,'decrease','--')))
df
Output:
value new
id
1 50 --
1 100 increase
1 200 increase
2 10 --
3 500 --
3 300 decrease
3 100 decrease
Concat strings:
df = df.groupby(['id'])['new'].apply(lambda x: ','.join(x)).reset_index()
df
Intermediate Result:
id new
0 1 --,increase,increase
1 2 --
2 3 --,decrease,decrease
Check if decrease exist in a row / only "--" exists. Drop them
df = df.drop(df[df['new'].str.contains("dec")].index.values)
df = df.drop(df[(df['new']=='--')].index.values)
df
Result:
id new
0 1 --,increase,increase

In pandas DataFrame, how to add column showing random selection result?

I've seen everywhere how to randomly select DataFrame rows in pandas (with and without numpy). What I haven't found is how to add a column to a DataFrame that indicates whether a row was randomly selected. Specifically, I need to
1) group rows by values in column A
2) randomly select 10 rows in each group without replacement
3) add a column B to indicate whether the each row was selected (TRUE/FALSE).
The result should be the original DataFrame (i.e., ungrouped) with an added column of TRUE/FALSE for every row (meaning, within its group, the row was selected during random selection).
I'm using python 3.6.2, pandas 0.20.3, numpy 1.13.1.
Edit in response to comments:
For this small sample of data, let's now say randomly select 2 rows without replacement per grouping by ImageType. Yes, the data sample does not have at least 2 of every ImageType. Please understand that the small dataset is to prevent making a really long post.
The data looks like this (there are thousands of rows):
+-----------+---------------------+
| ImageType | FileName |
+-----------+---------------------+
| 9 | PIC_001_01_0_9.JPG |
| 9 | PIC_022_17_0_9.JPG |
| 38 | PIC_100_00_0_38.jpg |
| 9 | PIC_293_12_0_9.JPG |
| 9 | PIC_381_14_0_9.JPG |
| 33 | PIC_001_17_2_33.JPG |
| 9 | PIC_012_07_0_9.JPG |
| 28 | PIC_306_00_0_28.jpg |
| 28 | PIC_178_08_0_28.JPG |
| 26 | PIC_225_11_0_26.JPG |
| 18 | PIC_087_16_0_18.JPG |
| 9 | PIC_089_18_0_9.JPG |
| 19 | PIC_090_18_0_19.JPG |
| 9 | PIC_091_18_0_9.JPG |
| 19 | PIC_092_18_2_19.JPG |
| 23 | PIC_270_14_0_23.JPG |
| 13 | PIC_271_14_0_13.JPG |
+-----------+---------------------+
The code is only a read from .csv, but to recreate the sample data above:
import pandas as pd
df = pd.DataFrame({'ImageType': ['9','9','38','9','9','33','9','28','28','26',
'18','9','19','9','19','23','13'],
'FileName': ['PIC_001_01_0_9.JPG','PIC_022_17_0_9.JPG',
'PIC_100_00_0_38.jpg','PIC_293_12_0_9.JPG',
'PIC_381_14_0_9.JPG','PIC_001_17_2_33.JPG',
'PIC_012_07_0_9.JPG','PIC_306_00_0_28.jpg',
'PIC_178_08_0_28.JPG','PIC_225_11_0_26.JPG',
'PIC_087_16_0_18.JPG','PIC_089_18_0_9.JPG',
'PIC_090_18_0_19.JPG','PIC_091_18_0_9.JPG',
'PIC_092_18_2_19.JPG','PIC_270_14_0_23.JPG',
'PIC_271_14_0_13.JPG']})
# group by ImageType
# select 2 rows randomly in each group, without replacement
# add a column to original DataFrame to indicate selected rows
def get_sample(df, n=2):
if len(df) <= n:
df['Sampled'] = True
else:
s = df.sample(n=n)
df['Sampled'] = df.apply(lambda x: x.name in s.index, axis=1)
return df
grouped = df.groupby('ImageType')
new_df = grouped.apply(get_sample)
print(new_df)
FileName ImageType Sampled
0 PIC_001_01_0_9.JPG 9 False
1 PIC_022_17_0_9.JPG 9 False
2 PIC_100_00_0_38.jpg 38 True
3 PIC_293_12_0_9.JPG 9 True
4 PIC_381_14_0_9.JPG 9 False
5 PIC_001_17_2_33.JPG 33 True
6 PIC_012_07_0_9.JPG 9 False
7 PIC_306_00_0_28.jpg 28 True
8 PIC_178_08_0_28.JPG 28 True
9 PIC_225_11_0_26.JPG 26 True
10 PIC_087_16_0_18.JPG 18 True
11 PIC_089_18_0_9.JPG 9 True
12 PIC_090_18_0_19.JPG 19 True
13 PIC_091_18_0_9.JPG 9 False
14 PIC_092_18_2_19.JPG 19 True
15 PIC_270_14_0_23.JPG 23 True
16 PIC_271_14_0_13.JPG 13 True
If the number of choices in the group is less than the sample number it will sample all of them.

Python pandas - construct multivariate pivot table to display count of NaNs and non-NaNs

I have a dataset based on different weather stations for several variables (Temperature, Pressure, etc.),
stationID | Time | Temperature | Pressure |...
----------+------+-------------+----------+
123 | 1 | 30 | 1010.5 |
123 | 2 | 31 | 1009.0 |
202 | 1 | 24 | NaN |
202 | 2 | 24.3 | NaN |
202 | 3 | NaN | 1000.3 |
...
and I would like to create a pivot table that would show the number of NaNs and non-NaNs per weather station, such that:
stationID | nanStatus | Temperature | Pressure |...
----------+-----------+-------------+----------+
123 | NaN | 0 | 0 |
| nonNaN | 2 | 2 |
202 | NaN | 1 | 2 |
| nonNaN | 2 | 1 |
...
Below I show what I have done so far, which works (in a cumbersome way) for Temperature. But how can I get the same for both variables, as shown above?
import pandas as pd
import bumpy as np
df = pd.DataFrame({'stationID':[123,123,202,202,202], 'Time':[1,2,1,2,3],'Temperature':[30,31,24,24.3,np.nan],'Pressure':[1010.5,1009.0,np.nan,np.nan,1000.3]})
dfnull = df.isnull()
dfnull['stationID'] = df['stationID']
dfnull['tempValue'] = df['Temperature']
dfnull.pivot_table(values=["tempValue"], index=["stationID","Temperature"], aggfunc=len,fill_value=0)
The output is:
----------------------------------
tempValue
stationID | Temperature
123 | False 2
202 | False 2
| True 1
UPDATE: thanks to #root:
In [16]: df.groupby('stationID')[['Temperature','Pressure']].agg([nans, notnans]).astype(int).stack(level=1)
Out[16]:
Temperature Pressure
stationID
123 nans 0 0
notnans 2 2
202 nans 1 2
notnans 2 1
Original answer:
In [12]: %paste
def nans(s):
return s.isnull().sum()
def notnans(s):
return s.notnull().sum()
## -- End pasted text --
In [37]: df.groupby('stationID')[['Temperature','Pressure']].agg([nans, notnans]).astype(np.int8)
Out[37]:
Temperature Pressure
nans notnans nans notnans
stationID
123 0 2 0 2
202 1 2 2 1
I'll admit this is not the prettiest solution, but it works. First define two temporary columns TempNaN and PresNaN:
df['TempNaN'] = df['Temperature'].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
df['PresNaN'] = df['Pressure'].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
Then define your results DataFrame using a MultiIndex:
Results = pd.DataFrame(index=pd.MultiIndex.from_tuples(list(zip(*[sorted(list(df['stationID'].unique())*2),['NaN','NonNaN']*df['stationID'].nunique()])),names=['stationID','NaNStatus']))
Store your computations in the results DataFrame:
Results['Temperature'] = df.groupby(['stationID','TempNaN'])['Temperature'].apply(lambda x: x.shape[0])
Results['Pressure'] = df.groupby(['stationID','PresNaN'])['Pressure'].apply(lambda x: x.shape[0])
And fill the blank values with zero:
Results.fillna(value=0,inplace=True)
You can loop over the columns if that is easier. For example:
Results = pd.DataFrame(index=pd.MultiIndex.from_tuples(list(zip(*[sorted(list(df['stationID'].unique())*2),['NaN','NonNaN']*df['stationID'].nunique()])),names=['stationID','NaNStatus']))
for col in ['Temperature','Pressure']:
df[col + 'NaN'] = df[col].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
Results[col] = df.groupby(['stationID',col + 'NaN'])[col].apply(lambda x: x.shape[0])
df.drop([col + 'NaN'],axis=1,inplace=True)
Results.fillna(value=0,inplace=True)
d = {'stationID':[], 'nanStatus':[], 'Temperature':[], 'Pressure':[]}
for station_id, data in df.groupby(['stationID']):
temp_nans = data.isnull().Temperature.mean()*data.isnull().Temperature.count()
pres_nans = data.isnull().Pressure.mean()*data.isnull().Pressure.count()
d['stationID'].append(station_id)
d['nanStatus'].append('NaN')
d['Temperature'].append(temp_nans)
d['Pressure'].append(pres_nans)
d['stationID'].append(station_id)
d['nanStatus'].append('nonNaN')
d['Temperature'].append(data.isnull().Temperature.count() - temp_nans)
d['Pressure'].append(data.isnull().Pressure.count() - pres_nans)
df2 = pd.DataFrame.from_dict(d)
print(df2)
The result is:
Pressure Temperature nanStatus stationID
0 0.0 0.0 NaN 123
1 2.0 2.0 nonNaN 123
2 2.0 1.0 NaN 202
3 1.0 2.0 nonNaN 202

Use pandas groupby.size() results for arithmetical operation

I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN

Categories