I am new to Pandas
I am accessing the date column which is in the format of
Restaurent ISSDTM
CREAMERY INC 4/5/2013 12:47
CREAMERY INC 4/5/2013 12:47
SANDRA 3/5/2009 11:23
SANDRA 8/26/2009 13:11
print(df['ISSDTTM'].dtype)--> Is an object
I want to do a count plot for this as per the year.
I tried using the
`df1=df['ISSDTTM'].apply(lambda x:x.split('/'))
to access the date but I am unable` to split the space in between. Also,
df1=df['ISSDTTM'].apply(lambda x:x.split(['/',' ']))
didn't work.
I also tried to access the last 4 digits using the
df2=df['ISSDTTM'].apply(lambda x:x[-1:-4])
Any approach to split this type of date formats? Should I use the dt.strformat?
Yes, you were on the right track with dt. Coerce to datetime and use dt.year.
pd.to_datetime(df.ISSDTM, errors='coerce').dt.year
0 2013
1 2013
2 2009
3 2009
Name: ISSDTM, dtype: int64
You can use DataFrame.plot.bar, or seaborn.countplot to generate a count-plot.
Related
I am working on a geospatial project where I need to do some calculations between groups of data within a data frame. The data I am using spans over several different years and specific to the Local Authority District code, each year has a numerical ID.
I need to be able to calculate the mean average of a group of years within that data set relative to the LAD code.
LAC LAN JAN FEB MAR APR MAY JUN ID
K04000001 ENGLAND AND WALES 56597 43555 49641 88049 52315 42577 5
E92000001 ENGLAND 53045 40806 46508 83504 49413 39885 5
I can use groupby to calculate the mean based on a LAC, but what I can't do is calculate the mean grouped by LAC for ID 1:3 for example.
What is more efficient, seperate in to seperate dataframes stored in an dict for example, or keep in one dataframe and use an ID?
df.groupby('LAC').mean()
I come frome a matlab background so just getting the hang of the best way to do things.
Secondly, once these operatons are complete, I would like to do the following:
(mean of id - 1:5 - mean id:6) using LAC as the key.
Sorry if I haven't explained this very well!
Edit: Expected output.
To be able to average a group of rows by specific ID for a given value of LAC.
For example:
Average monthly values for E92000001 rows with ID 3
LAC JAN FEB MAR APR MAY JUN ID
K04000001, 56706 43653 49723 88153 52374 42624 5
K04000001 56597 43555 49641 88049 52315 42577 5
E92000001 49186 36947 42649 79645 45554 36026 5
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 68715 56476 62178 99174 65083 55555 4
E92000001 41075 28836 34538 71534 37443 27915 3
E92000001 54595 42356 48058 85054 50963 41435 1
Rows to be averaged:
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 41075 28836 34538 71534 37443 27915 3
Result
E92000001 47060 34821 40523 77519 43428 33900 3
edit: corrected error.
To match the update in your question. This will give you a dataframe with only one row for each ID-LAC combination, with the average of all the rows that had that index.
df.groupby(['ID', 'LAC']).mean()
I would start by setting the year and LAC as the index
df.set_index(['ID', 'LAC'], inplace=True).sort_index(inplace=True)
Now you can groupby Index and get the mean for every month, or even each row's average since the first year.
expanding_mean = df.groupby('index').cumsum() / (df.groupby('index').cumcount() + 1)
I have a question. I am dealing with a Datetime DataFrame in Pandas. I want to perform a count on a particular column and group by the month.
For example:
df.groupby(df.index.month)["count_interest"].count()
Assuming that I am analyzing a Data From December 2019. I get a result like this
date
1 246
2 360
3 27
12 170
In reality, December 2019 is supposed to come first. Please what can I do because when I plot the frame grouped by month, the December 2019 is showing at the last and this is practically incorrect.
See plot below for your understanding:
You can try reindex:
df.groupby(df.index.month)["count_interest"].count().reindex([12,1,2,3])
columnx columny results
2019-02-15 2 2019-04-15
2019-05-08 1 2019-06-08
It should not change the days of the month like 15 should be 15 and 8 should be 8. In case of 31 to 30 and vice versa, it's okay. Most Importantly I don't wanna use .apply(). Thanks!
This should solve it.
Pls check that you have datetime format in your columnx and then run below.
EDIT
df["results"]=df["columnx"]+ df['columny'].astype('timedelta64[M]'))
Consider the DataFrame data:
one two three four
Ohio 2013-01-01 1 2 3
Colorado 2014-01-05 5 6 7
Utah 2015-05-06 9 10 11
New York 2016-10-11 13 14 15
I'd like to extract the row using only the criterion that the year is a given year, e.g., something like data['one'][:][0:4] == '2013'. But the command data['one'][:][0:4] returns
Ohio 2013-01-01
Colorado 2014-01-05
Utah 2015-05-06
New York 2016-10-11
Name: one, dtype: object
I thought this is the right thing to do because the command data['one'][0][0:4] returns
'2013'
Why the difference, and what's the correct way to do this?
Since column 'one' consists of dates, it'd be best to have pandas recognize it as such, instead of recognizing it as strings. You can use pd.to_datetime to do this:
df['one'] = pd.to_datetime(df['one'])
This allows you to filter on date properties without needing to worry about slicing strings. For example, you can check for year using Series.dt.year:
df['one'].dt.year == 2013
Combining this with loc allows you to get all rows where the year is 2013:
df.loc[df['one'].dt.year == 2013, :]
The condition you are looking for is
df['one'].str[0:4] == "2013"
Basically, you need to tell Pandas to read your column as a string, then operate on the strings from that column.
The way you have it written (df['one'][:]), says "give me the column called "one", then give me all of them [:].
query works out well too on datetime columns
In [13]: df.query('one == 2013')
Out[13]:
one two three four
Ohio 2013-01-01 1 2 3
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases