data cleaning a python dataframe

data cleaning a python dataframe - python

I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!

#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4

Related

create column name repeats for column values when particular columns have duplicate rows

I have a dataframe that I need to spin around (am not sure if this involves stacking or pivoting..)
So, where I have duplicate values in columns "Year", "Month and "Group" , I want to shift the follow columns names to be repeated for the Variable
So if this is the original DF:
Year Month Group Variable feature1 feature2 feature3
2010 6 1 1 12 23 56
2010 6 1 2 34 56 25
The result will be :
Year Month Group Variable1 feature1_1 feature2_1 feature3_1 Variable2 feature1_2 feature2_2 feature3_2
2010 6 1 1 12 23 56 2 34 56 25
I am looking for something along these lines - any tips/help is much appreciated,
Thankyou
Izzy

IIUC, if you want to convert it back from long to wide , you can using cumcount get the addtional key , then reshape.(Notice this reverse of wide_to_long)
df['New']=(df.groupby(['Year','Month','Group']).cumcount()+1).astype(str)
w=df.set_index(['Year','Month','Group','New']).unstack().sort_index(level=1,axis=1)
w.columns=pd.Index(w.columns).str.join('_')
w
Out[217]:
Variable_1 feature1_1 feature2_1 feature3_1 Variable_2 \
Year Month Group
2010 6 1 1 12 23 56 2
feature1_2 feature2_2 feature3_2
Year Month Group
2010 6 1 34 56 25

Group Daily Data by Week for Python Dataframe

So I have a Python dataframe that is sorted by month and then by day,
In [4]: result_GB_daily_average
Out[4]:
NREL Avert
Month Day
1 1 14.718417 37.250000
2 40.381167 45.250000
3 42.512646 40.666667
4 12.166896 31.583333
5 14.583208 50.416667
6 34.238000 45.333333
7 45.581229 29.125000
8 60.548479 27.916667
9 48.061583 34.041667
10 20.606958 37.583333
11 5.418833 70.833333
12 51.261375 43.208333
13 21.796771 42.541667
14 27.118979 41.958333
15 8.230542 43.625000
16 14.233958 48.708333
17 28.345875 51.125000
18 43.896375 55.500000
19 95.800542 44.500000
20 53.763104 39.958333
21 26.171437 50.958333
22 20.372688 66.916667
23 20.594042 42.541667
24 16.889083 48.083333
25 16.416479 42.125000
26 28.459625 40.125000
27 1.055229 49.833333
28 36.798792 42.791667
29 27.260083 47.041667
30 23.584917 55.750000
This continues on for every month of the year and I would like to be able to sort it by week instead of day, so that it looks something like this:
In [4]: result_GB_week_average
Out[4]:
NREL Avert
Month Week
1 1 Average values from first 7 days
2 Average values from next 7 days
3 Average values from next 7 days
4 Average values from next 7 days
And so forth. What would the easiest way to do this be?

I assume by weeks you don't mean actual calendar week!!! Here is my proposed solution:
#First add a dummy column
result_GB_daily_average['count'] = 1
#Then calculate a cumulative sum and divide it by 7
result_GB_daily_average['Week'] = result_GB_daily_average['count'].cumsum() / 7.0
#Then Round the weeks
result_GB_daily_average['Week']=result_GB_daily_average['Week'].round()
#Then do the group by and calculate average
result_GB_week_average = result_GB_daily_average.groupby('Week')['NREL','AVERT'].mean()

Two DataFrames Random Sample by Day grouping instead of hour

I have two dataframes, One is Price and the other one is Volume. They are both hourly and for the the same timeframe (one year).
dfP = pd.DataFrame(np.random.randint(5, 10, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
dfV = pd.DataFrame(np.random.randint(50, 100, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
Each Day is a SET in the sense that the values have to stay together. When a sample is generated, it needs to be a full day. so a sample would be (for example 24 hours of Feb 2, 2008) in this data set. I would like to generate a 185 day (50%) sample set for dfP and have the Volumes from the same days so i can generate a sum product.
dfProduct = dfP_Sample * dfV_Sample
I am lost on how to achieve this. Any help is appreciated.

It sounds like you're expecting to get the sum of the volumes and prices for each day and then multiply them together?
If that's the case, try the following. If not, please clarify your question.
priceGroup = dfP.groupby(by=dfP.index.date).sum()
volumeGroup = dfV.grouby(by=dfV.index.date).sum()
dfProduct = priceGroup*volumeGroup
If you want to just look at a specific date range, try
import datetime as datetime
dfProduct[np.logical_and(dfProduct.index > datetime.date(2006,08,09),dfProduct.index < datetime.date(2007,01,02))]

First of all we'll generate a column that refers to the day index of the year for example 2008-01-01 will be assigned 1 because it indicates first day of the year and so on
day_order = [date.timetuple().tm_yday for date in dfP.index]
dfP['day_order'] = day_order
then generate random days from 1 to 365 this will represent the day order in the year for example if you get random number 1 this indicates 2008-01-01
random_days = np.random.choice(np.arange(1 , 366) , size = 185 , replace=False)
then slice your original data frame to get only values from random sample according to the day order column we've created previously
dfP_sample = dfP[dfP.day_order.isin(random_days)]
then you can merge both frames on index , and you can do whatever you want
final = pd.merge(dfP_sample , dfV , left_index=True , right_index=True)
final.head()
Out[47]:
Col1_x Col2_x Col3_x Col4_x day_order Col1_y Col2_y Col3_y Col4_y
2008-01-03 00:00:00 9 6 9 9 3 66 85 62 82
2008-01-03 01:00:00 5 8 9 8 3 54 89 65 98
2008-01-03 02:00:00 7 5 5 9 3 83 58 60 96
2008-01-03 03:00:00 9 5 7 6 3 59 54 67 78
2008-01-03 04:00:00 9 5 8 9 3 92 66 66 55
if you don't want to merge both frames , you can apply the same logic on dfV
and then you will get samples from both data frames on the same days

Python (Pandas) updating previous x rows within specified condition

I have data about machine failures. The data is in a pandas dataframe with date, id, failure and previous_30_days columns. The previous_30_days column is currently all zeros. My desired outcome is to populate rows in the previous_30_days column with a '1' if they occur within a 30 day time-span before a failure. I am currently able to do this with the following code:
failure_df = df[(df['failure'] == 1)] # create a dataframe of just failures
for index, row in failure_df.iterrows():
df.loc[(df['date'] >= (row.date - datetime.timedelta(days=30))) &
(df['date'] <= row.date) & (df['id'] == row.id), 'previous_30_days'] = 1
Note that I also check for the id match, because dates are repeated in the dataframe, so I cannot assume it is simply the previous 30 rows.
My code works, but the problem is that the dataframe is millions of rows, and this code is too slow at the moment.
Is there a more efficient way to achieve the desired outcome? Any thoughts would be very much appreciated.

I'm a little confused about how your code works (or is supposed to work), but this ought to point you in the right direction and can be easily adapted. It will be much faster by avoiding iterrows in favor of vectorized operations (about 7x faster for this small dataframe, it should be a much bigger improvement on your large dataframe).
np.random.seed(123)
df=pd.DataFrame({ 'date':np.random.choice(pd.date_range('2015-1-1',periods=300),20),
'id':np.random.randint(1,4,20) })
df=df.sort(['id','date'])
Now, calculate days between current and previous date (by id).
df['since_last'] = df.groupby('id')['date'].apply( lambda x: x - x.shift() )
Then create your new column based on the number of days to the previous date.
df['previous_30_days'] = df['since_last'] < datetime.timedelta(days=30)
date id since_last previous_30_days
12 2015-02-17 1 NaT False
6 2015-02-27 1 10 days True
3 2015-03-25 1 26 days True
0 2015-04-09 1 15 days True
10 2015-04-24 1 15 days True
5 2015-05-04 1 10 days True
11 2015-05-07 1 3 days True
8 2015-08-14 1 99 days False
14 2015-02-02 2 NaT False
9 2015-04-07 2 64 days False
19 2015-07-28 2 112 days False
7 2015-08-03 2 6 days True
15 2015-08-13 2 10 days True
1 2015-08-19 2 6 days True
2 2015-01-18 3 NaT False
13 2015-03-15 3 56 days False
18 2015-04-07 3 23 days True
4 2015-04-17 3 10 days True
16 2015-04-22 3 5 days True
17 2015-09-11 3 142 days False

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.

Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()

You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

data cleaning a python dataframe - python

Related

create column name repeats for column values when particular columns have duplicate rows

Group Daily Data by Week for Python Dataframe

Two DataFrames Random Sample by Day grouping instead of hour

Python (Pandas) updating previous x rows within specified condition

Time series: Mean per hour per day per Id number

Categories

Resources