Grouping by date and number of unique users for multiple variables - python

I have a dataframe containing tweets. I've got columns with information about the datetime, about a unique user_id and then columns indicating if the tweet belongs to a thematic category. In the end I'd like to visualize it with a line graph.
The data looks as follows:
datetime user_id Meta News & Media Environment ...
0 2019-05-08 07:16:02 21741359 NaN NaN 1.0
1 2019-05-08 07:15:23 2785265103 NaN NaN 1.0
2 2019-05-08 07:14:11 606785697 NaN 1.0 NaN
3 2019-05-08 07:13:42 718989200616529921 1.0 NaN NaN
4 2019-05-08 07:13:27 939207240728350720 1.0 NaN 1.0
... ... ... ... ... ...
So far I've managed to produce one just summing each theme per day with the following code:
monthly_trends = tweets_df.groupby(pd.Grouper(key='datetime', freq='D'))[list(issues.keys())].sum().fillna(0)
which gives me:
Meta News & Media Environment ...
datetime
2019-05-07 586.0 25.0 30.0
2019-05-08 505.0 16.0 70.0
2019-05-09 450.0 12.0 50.0
2019-05-10 339.0 8.0 90.0
2019-05-11 254.0 5.0 10.0
I plot this with:
monthly_trends.plot(kind='line', figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20)
plt.title('Issue activity during the election period', size = 30)
plt.show()
Which gives me a nice graph. But since one user may just be spamming one theme, I'd like to get a count of the frequency of unique users per theme per day. I've tried using additional groupby's but only got errors.

For pandas' DataFrame.plot across multiple series you need data in wide format with separate columns. However, for unique user_id calculation you need data in long format for the aggregation. Therefore, consider melt, groupby, then pivot back for plotting. Had you not needed a
### RESHAPE LONG AND AGGREGATE
long_df = (tweets_df.melt(id_vars=['datetime', 'user_id'],
value_name = 'Count', var_name = 'Issue')
.query("Count >= 1")
.groupby([pd.Grouper(key='datetime', freq='D'), 'Issue'])['user_id'].nunique()
.reset_index()
)
### RESHAPE WIDE AND PLOT
(long_df.pivot(index='datetime', columns='Issue', values='user_id')
.plot(kind='line', title='Unique Users by Day and Tweet Issue')
)
plt.show()
plt.clf()
plt.close()

Stack all issues, group by issue and day, and count the unique user ids:
df.columns.names = ['issue']
df_users = (df.set_index(['datetime', 'user_id'])[issues]
.stack()
.reset_index().groupby([pd.Grouper(key='datetime', freq='D'), 'issue'])
.apply(lambda x: len(x.user_id.unique()))
.rename('n_unique_users').reset_index())
print(df_users)
datetime issue n_unique_users
0 2019-05-08 Environment 3
1 2019-05-08 Meta 2
2 2019-05-08 News & Media 1
Then you can reshape as required for plotting:
df_users.pivot_table(index='datetime', columns='issue', values='n_unique_users', aggfunc=sum)
issue Environment Meta News & Media
datetime
2019-05-08 3 2 1

Related

how to clean and rearrange a dataframe with pairs of date and price columns into a df with common date index?

I have a dataframe of price data that looks like the following: (with more than 10,000 columns)
Unamed: 0
01973JAC3 corp
Unamed: 2
019754AA8 corp
Unamed: 4
01265RTJ7 corp
Unamed: 6
01988PAD0 corp
Unamed: 8
019736AB3 corp
1
2004-04-13
101.1
2008-06-16
99.1
2010-06-14
110.0
2008-06-18
102.1
NaT
NaN
2
2004-04-14
101.2
2008-06-17
100.4
2010-07-05
110.3
2008-06-19
102.6
NaT
NaN
3
2004-04-15
101.6
2008-06-18
100.4
2010-07-12
109.6
2008-06-20
102.5
NaT
NaN
4
2004-04-16
102.8
2008-06-19
100.9
2010-07-19
110.1
2008-06-21
102.6
NaT
NaN
5
2004-04-19
103.0
2008-06-20
101.3
2010-08-16
110.3
2008-06-22
102.8
NaT
NaN
...
...
...
...
...
...
...
...
...
NaT
NaN
3431
NaT
NaN
2021-12-30
119.2
NaT
NaN
NaT
NaN
NaT
NaN
3432
NaT
NaN
2021-12-31
119.4
NaT
NaN
NaT
NaN
NaT
NaN
(Those are 9-digit CUSIPs in the header. So every two columns represent date and closed price for a security.)
I would like to
find and get rid of empty pairs of date and price like "Unamed: 8" and"019736AB3 corp"
then rearrange the dateframe to a panel of monthly close price as following:
Date
01973JAC3
019754AA8
01265RTJ7
01988PAD0
2004-04-30
102.1
NaN
NaN
NaN
2004-05-31
101.2
NaN
NaN
NaN
...
...
...
...
...
2021-12-30
NaN
119.2
NaN
NaN
2021-12-31
NaN
119.4
NaN
NaN
Edit:
I wanna clarify my question.
So my dataframe has more than 10,000 columns, which makes it impossible to just drop by column names or change their names one by one. The pairs of date and price start and end at different time and are of different length (, and of different frequency). I m looking for an efficient way to arrange therm into a less messy form. Thanks.
Here is a sample of 30 columns. https://github.com/txd2x/datastore file name: sample-question2022-01.xlsx
I figured out: stacking and then reshaping.Thx for the help.
for i in np.arange(len(price.columns)/2):
temp =DataFrame(columns = ['Date', 'ClosedPrice','CUSIP'])
temp['Date'] = price.iloc[ 0:np.shape(price)[0]-1, int(2*i)]
temp['ClosedPrice'] = price.iloc[0:np.shape(price)[0]-1, int(2*i+1)]
temp['CUSIP'] =price.columns[int(i*2+1)][:9] #
df = df.append(temp)
#use for loop to stack all the column pairs
df = df.dropna(axis=0, how = 'any') # drop nan rows
df = df.pivot(index='Date', columns = 'CUSIP', values = 'ClosedPrice') #reshape dataframe to have Date as index and CUSIP and column headers
df_monthly=df.resample('M').last() #finding last price of the month
if you want to get rid of unusful columns then perform the following code:
df.drop("name_of_column", axis=1, inplace=True)
if you want to drop empty rows use:
df.drop(df.index[row_number], inplace=True)
if you want to rearrange the data using 'timestamp and date' you need to convert it to a datetime object and then make it as index:
import datetime
df.Date=pd.to_datetime(df.Date)
df = df.set_index('Date')
and you probably want to change column name before doing any of that above, df.rename(columns={'first_column': 'first', 'second_column': 'second'}, inplace = True)
Updated01:
if you want to keep just some columns of those 10000, lets say for example 10 or 7 columns, then use df = df[["first_column","second_column", ....]]
if you want to get rid of all empty columns use: df.dropna(axis=1, how = 'all') "how" keyword have two values: "all" to drop the whole row or column if it is full of Nan, "any" to drop the whole row or column if it have one Nan at least.
Update02:
Now if you have got a lot of date columns and you just want to keep one of them, supposing that you have choosed a date column that have no "Nan" values use the following code:
columns=df.columns.tolist()
for column in columns:
try:
if(df[column].dtypes=='object'):
df[column]=pd.to_datetime(df[column]).
if(df[column].dtypes=='datetime64[ns]')&(column!='Date'):
df.drop(column,axis=1,inplace=True)
except ValueError:
pass
rearrange the dataframe using months:
import datetime
df.Date=pd.to_datetime(df.Date)
df['Month']=df.Date.dt.month
df['Year']=df.Date.dt.year
df = df.set_index('Month')
df.groupby(["Year","Month"]).mean()
update03:
To combine all date columns while preserving data use the following code:
import pandas as pd
import numpy as np
df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
if (df[column].isnull().sum()>2300):
df.drop(column,axis=1,inplace=True)
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:
if(df[column].dtypes=='datetime64[ns]'):
df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
else:
df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
merged=pd.merge(merged,df[[columns[i],columns[i+1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
k+=1
the only problem left that it will throw a memory Error.
MemoryError: Unable to allocate 27.4 GiB for an array with shape (3677415706,) and data type int64

Error while custom periods based resampling using resample('W'),sum() in python

I have the data frame ( frame_combined_DF) which looks like this. I need to do custom resampling based on Time_W weeks provided for each SKU.
frame_combined_DF
SKU Qty Time Time_W
WY
2011-10-17 ABC 12.0 11.0 2
2012-01-16 ABC 20.0 11.0 2
2013-04-08 ABC 6.0 11.0 2
2013-12-02 ABC 2.0 11.0 2
2014-10-27 XYZ 1.0 21.0 3
Below is my code
for i in ids:
subset = frame_combined_DF.loc[frame_combined_DF.SKU==i]
subset.index=subset.WY
subset.sort_index(inplace=True)
period=subset.Time_W.unique().astype('int64')[0]
per=str(period)+'W'
df = subset.Qty.resample(per).sum()
new_df = {'WY':df.index, 'Qty':df.values,'SKU':i}
newdf = pd.DataFrame(new_df)
new_series=new_series.append(newdf)
I am getting following error while running this code
ValueError: Offset <0 * Weeks: weekday=6> did not increment date
Expected output is as under. Below example is only for 1 SKU. This SKU needs to be re sampled at frequency of 2 weeks, where as SKU XYZ to be resampled for for every three weeks
WY Qty SKU
2011-10-17 12.0 ABC
2011-10-31 0.0 ABC
2011-11-14 0.0 ABC
2011-11-28 0.0 ABC
2011-12-12 0.0 ABC
.........................
.........................
2012-01-09 20.0 ABC
2012-01-23 0.0 ABC
..........................
From your sample data I see that WY is the index column.
But check whether this column is of datetime type (not string).
If it is not, run frame_combined_DF.index = pd.to_datetime(frame_combined_DF.index).
Another point to note is that newdf is a DataFrame, not a Series,
so you should append it to a DataFrame.
The third remark is that subset.index = subset.WY is not needed, because
WY is already the index.
And the last thing: Your sample did not define new_series (in my solution
I changed it to result).
So change your code to:
result = pd.DataFrame()
for i in frame_combined_DF.SKU.unique():
subset = frame_combined_DF.loc[frame_combined_DF.SKU==i]
subset.sort_index(inplace=True)
period = subset.Time_W.unique().astype('int64')[0]
per = str(period) + 'W'
df = subset.Qty.resample(per).sum()
new_df = {'WY': df.index, 'Qty': df.values, 'SKU': i}
newdf = pd.DataFrame(new_df)
result = result.append(newdf, ignore_index=True)
and it should run, at least on my computer it gives no error.

Plotting counts of a dataframe grouped by timestamp

So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN

Trouble with for loop in a function and combing multiple series output

I'm new to python and am struggling to figure something out. I'm doing some data analysis on an invoice database in pandas with columns of $ amounts, credits, date, and a unique company ID for each package bought.
I want to run every unique company id through a function that will calculate the average spend rate of these credits based on the difference of package purchase dates. I have the basics figured out in my function, and it returns a series indexed to the original dataframe with the values of the average amount of credits spent each day between packages. However, I only have it working with one company ID at a time, and I don'tknow what kind of process I can do to combine all of these different series for each company id to be able to correctly add a new column onto my dataframe with this average credit spend value for each package. Here's my code so far:
def creditspend(mylist = []):
for i in mylist:
a = df.loc[df['CompanyId'] == i]
a = a.sort_values(by=['Date'], ascending=False)
days = a.Date.diff().map(lambda x: abs(x.days))
spend = a['Credits']/days
print(spend)
If I call
creditspend(mylist=[8, 15])
(with multiple inputs) it obviously does not work. What do I need to do to complete this function?
Thanks in advance.
apply() is a very useful method in pandas that applies a function to every row or column of a DataFrame.
So, if your DataFrame is df:
def creditspend(row):
# some calculation code here
return spend
df['spend_rate'] = df.apply(creditspend)
(You can also use apply() on columns with the axis=1 keyword.)
Consider a groupby for a CompanyID aggregation. Below demonstrates with random data:
import numpy as np
import pandas as pd
np.random.seed(7182018)
df = pd.DataFrame({'CompanyID': np.random.choice(['julia', 'pandas', 'r', 'sas', 'stata', 'spss'],50),
'Date': np.random.choice(pd.Series(pd.date_range('2018-01-01', freq='D', periods=180)), 50),
'Credits': np.random.uniform(0,1000,50)
}, columns=['Date', 'CompanyID', 'Credits'])
# SORT ONCE OUTSIDE OF PROCESSING
df = df.sort_values(by=['CompanyID', 'Date'], ascending=[True, False]).reset_index(drop=True)
def creditspend(g):
g['days'] = g.Date.diff().map(lambda x: abs(x.days))
g['spend'] = g['Credits']/g['days']
return g
grp_df = df.groupby('CompanyID').apply(creditspend)
Output
print(grp_df.head(20))
# Date CompanyID Credits days spend
# 0 2018-06-20 julia 280.522287 NaN NaN
# 1 2018-06-12 julia 985.009523 8.0 123.126190
# 2 2018-05-17 julia 892.308179 26.0 34.319545
# 3 2018-05-03 julia 97.410360 14.0 6.957883
# 4 2018-03-26 julia 480.206077 38.0 12.637002
# 5 2018-03-07 julia 78.892365 19.0 4.152230
# 6 2018-03-03 julia 878.671506 4.0 219.667877
# 7 2018-02-25 julia 905.172807 6.0 150.862135
# 8 2018-02-19 julia 970.016418 6.0 161.669403
# 9 2018-02-03 julia 669.073067 16.0 41.817067
# 10 2018-01-23 julia 636.926865 11.0 57.902442
# 11 2018-01-11 julia 790.107486 12.0 65.842291
# 12 2018-06-16 pandas 639.180696 NaN NaN
# 13 2018-05-21 pandas 565.432415 26.0 21.747401
# 14 2018-04-22 pandas 145.232115 29.0 5.008004
# 15 2018-04-13 pandas 379.964557 9.0 42.218284
# 16 2018-04-12 pandas 538.168690 1.0 538.168690
# 17 2018-03-20 pandas 783.572993 23.0 34.068391
# 18 2018-03-14 pandas 618.354489 6.0 103.059081
# 19 2018-02-10 pandas 859.278127 32.0 26.852441

resample Pandas dataframe and merge strings in column

I want to resample a pandas dataframe and apply different functions to different columns. The problem is that I cannot properly process a column with strings. I would like to apply a function that merges the string with a delimiter such as " - ". This is a data example:
import pandas as pd
import numpy as np
idx = pd.date_range('2017-01-31', '2017-02-03')
data=list([[1,10,"ok"],[2,20,"merge"],[3,30,"us"]])
dates=pd.DatetimeIndex(['2017-01-31','2017-02-03','2017-02-03'])
d=pd.DataFrame(data, index=,columns=list('ABC'))
A B C
2017-01-31 1 10 ok
2017-02-03 2 20 merge
2017-02-03 3 30 us
Resampling the numeric columns A and B with a sum and mean aggregator works. Column C however kind of works with sum (but it gets placed on the second place, which might mean that something fails).
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': sum})
A C B
2017-01-31 1.0 a 10.0
2017-02-01 NaN 0 NaN
2017-02-02 NaN 0 NaN
2017-02-03 5.0 merge us 25.0
I would like to get this:
...
2017-02-03 5.0 merge - us 25.0
I tried using lambda in different ways but without success (not shown).
If I may ask a second related question: I can do some post processing for this, but how to fill missing cells in different columns with zeros or ""?
Your agg function for column 'C' should be a join
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': ' - '.join})
A B C
2017-01-31 1.0 10.0 ok
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 5.0 25.0 merge - us

Categories