Manipulating data from csv using pandas

Manipulating data from csv using pandas - python

here is a question about the data from pandas. What I am looking is to fetch two column from a csv file, and manipulate these data before finally saving them.
The csv file looks like :
year month
2007 1
2007 2
2007 3
2007 4
2008 1
2008 3
this is my current code:
records = pd.read_csv(path)
frame = pd.DataFrame(records)
combined = datetime(frame['year'].astype(int), frame['month'].astype(int), 1)
The error is :
TypeError: cannot convert the series to "<type 'int'>"
any thoughts?

datetime won't operate on a pandas Series (column of a dataframe). You can use to_datetime or you could use datetime within apply. Something like the following should work:
In [9]: df
Out[9]:
year month
0 2007 1
1 2007 2
2 2007 3
3 2007 4
4 2008 1
5 2008 3
In [10]: pd.to_datetime(df['year'].astype(str) + '-'
+ df['month'].astype(str)
+ '-1')
Out[10]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Or use apply:
In [11]: df.apply(lambda x: datetime(x['year'],x['month'],1),axis=1)
Out[11]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Another Edit: You can also do most of the date parsing with read_csv but then you need to adjust the day after you read it in (note, my data is in a string named 'data'):
In [12]: df = pd.read_csv(StringIO(data),header=True,
parse_dates={'date':['year','month']})
In [13]: df['date'] = df['date'].values.astype('datetime64[M]')
In [14]: df
Out[14]:
date
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01

Had similar issue the answer is assuming that you have the Year, Month and Day in columns of your DataFrame:
df['Date'] = df[['Year', 'Month', 'Day']].apply(lambda s : datetime.datetime(*s),axis = 1)
first part selects the columns with the Year, Month and Date form the Dateframe, second bit applies the datetime function element-wise on the data.
if you do not gave the day in your data asit looks like form your data, just do:
df['Day'] = 1
to place the day there as well. should be way to do that in code, but will be quick workaround. Can always drop the Day column afterward if you dont want it.

Related

How to convert columns in a dataframe into time series?

So I selected 3 columns from my dataframe in order to create a time series that I could then plot:
booking_date = pd.DataFrame({'day': hotel_bookings_cleaned["arrival_date_day_of_month"],
'month': hotel_bookings_cleaned["arrival_date_month"],
'year': hotel_bookings_cleaned["arrival_date_year"]})
and the output looks like:
day month year
0 1 July 2015
1 1 July 2015
2 1 July 2015
3 1 July 2015
4 1 July 2015
I tried using
dates = pd.to_datetime(booking_date)
but got the error message
ValueError: Unable to parse string "July" at position 0
I'm assuming I need to convert the Month column to a numeric value before I can convert it to a datetime, but I haven't been able to make any parsers work.

Try this
dates = pd.to_datetime(booking_date.astype(str).agg('-'.join, axis=1), format='%d-%B-%Y')
Out[13]:
0 2015-07-01
1 2015-07-01
2 2015-07-01
3 2015-07-01
4 2015-07-01
dtype: datetime64[ns]

Not sure if this is more performant than the previous answer, but you can convert your string column to integers with a dictionary mapping to fit the format that pandas expects in to_datetime()
month_map = {
'January':1,
'February':2,
'March':3,
'April':4,
'May':5,
'June':6,
'July':7,
'August':8,
'September':9,
'October':10,
'November':11,
'December':12
}
dates = pd.DataFrame({
'day':booking_date.day,
'month':booking_date.month.apply(lambda x: month_map[x]),
'year':booking_date.year
})
ts = pd.to_datetime(dates)

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!

we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.

You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

Converting date using to_datetime

I am still quite new to Python, so please excuse my basic question.
After a reset of pandas grouped dataframe, I get the following:
year month pl
0 2010 1 27.4376
1 2010 2 29.2314
2 2010 3 33.5714
3 2010 4 37.2986
4 2010 5 36.6971
5 2010 6 35.9329
I would like to merge year and month to one column in pandas datetime format.
I am trying:
C3['date']=pandas.to_datetime(C3.year + C3.month, format='%Y-%m')
But it gives me a date like this:
year month pl date
0 2010 1 27.4376 1970-01-01 00:00:00.000002011
What is the correct way? Thank you.

You need to convert to str if necessary, then zfill the month col and pass this with a valid format to to_datetime:
In [303]:
df['date'] = pd.to_datetime(df['year'].astype(str) + df['month'].astype(str).str.zfill(2), format='%Y%m')
df
Out[303]:
year month pl date
0 2010 1 27.4376 2010-01-01
1 2010 2 29.2314 2010-02-01
2 2010 3 33.5714 2010-03-01
3 2010 4 37.2986 2010-04-01
4 2010 5 36.6971 2010-05-01
5 2010 6 35.9329 2010-06-01
If the conversion is unnecessary then the following should work:
df['date'] = pd.to_datetime(df['year'] + df['month'].str.zfill(2), format='%Y%m')
Your attempt failed as it treated the value as epoch time:
In [305]:
pd.to_datetime(20101, format='%Y-%m')
Out[305]:
Timestamp('1970-01-01 00:00:00.000020101')

Pandas Python- can datetime be used with vectorized inputs

My pandas dataframe has year, month and date in the first 3 columns. To convert them into a datetime type, i use a for loop that loops over each row taking the content in the first 3 columns of each row as inputs to the datetime function. Any way i can avoid the for loop here and get the dates as a datetime?

I'm not sure there's a vectorized hook, but you can use apply, anyhow:
>>> df = pd.DataFrame({"year": [1992, 2003, 2014], "month": [2,3,4], "day": [10,20,30]})
>>> df
day month year
0 10 2 1992
1 20 3 2003
2 30 4 2014
>>> df["Date"] = df.apply(lambda x: pd.datetime(x['year'], x['month'], x['day']), axis=1)
>>> df
day month year Date
0 10 2 1992 1992-02-10 00:00:00
1 20 3 2003 2003-03-20 00:00:00
2 30 4 2014 2014-04-30 00:00:00

Pandas pivot_table on date

I have a pandas DataFrame with a date column. It is not an index.
I want to make a pivot_table on the dataframe using counting aggregate per month for each location.
The data look like this:
['INDEX'] DATE LOCATION COUNT
0 2009-01-02 00:00:00 AAH 1
1 2009-01-03 00:00:00 ABH 1
2 2009-01-03 00:00:00 AAH 1
3 2009-01-03 00:00:00 ABH 1
4 2009-01-04 00:00:00 ACH 1
I used:
pivot_table(cdiff, values='COUNT', rows=['DATE','LOCATION'], aggfunc=np.sum)
to pivot the values. I need a way to convert cdiff.DATE to a month rather than a date.
I hope to end up with something like:
The data look like this:
MONTH LOCATION COUNT
January AAH 2
January ABH 2
January ACH 1
I tried all manner of strftime methods on cdiff.DATE with no success. It wants to apply the to strings, not series object.

I would suggest:
months = cdiff.DATE.map(lambda x: x.month)
pivot_table(cdiff, values='COUNT', rows=[months, 'LOCATION'],
aggfunc=np.sum)
To get a month name, pass a different function or use the built-in calendar.month_name. To get the data in the format you want, you should call reset_index on the result, or you could also do:
cdiff.groupby([months, 'LOCATION'], as_index=False).sum()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Manipulating data from csv using pandas - python

Related

How to convert columns in a dataframe into time series?

parse multiple date format pandas

Converting date using to_datetime

Pandas Python- can datetime be used with vectorized inputs

Pandas pivot_table on date

Categories

Resources