I have 2 types of text files which I need to read into a pandas DataFrame. I have a problem with the datetimes and separator.
File A:
2009,7,1,3,101,13.03,89.33,0.6,287.69,0
2009,7,1,6,102,19.3,55,1,288.67,0
2009,7,1,9,103,22.33,39.67,1,289.6,0
2009,7,1,12,104,21.97,41,1,295.68,0
File B:
2019 9 1 3.00 101 14.02 92.08 2.62 174.77 0.109
2019 9 1 6.00 102 13.79 92.86 2.79 179.29 0.046
2019 9 1 9.00 103 13.81 92.60 2.73 178.94 0.070
2019 9 1 12.00 104 13.31 95.20 2.91 179.38 0.015
fileA.txt has no extra spaces, fileB.txt has 1 extra space on the beginning of each line. I can read each of these as follows and the results are correct:
>>> import pandas as pd
>>> from datetime import datetime as dtdt
>>> par3 = lambda x: dtdt.strptime(x, '%Y %m %d %H')
>>> par4 = lambda x: dtdt.strptime(x, '%Y %m %d %H.%M')
>>> df3=pd.read_csv('fileA.txt',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par3, index_col='Date')
>>> df3
4 5 6 7 8 9
Date
2009-07-01 03:00:00 101 13.03 89.33 0.6 287.69 0
2009-07-01 06:00:00 102 19.30 55.00 1.0 288.67 0
2009-07-01 09:00:00 103 22.33 39.67 1.0 289.60 0
2009-07-01 12:00:00 104 21.97 41.00 1.0 295.68 0
>>> dg3=pd.read_csv('fileB.txt',sep='\s+',engine='python',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par4, index_col='Date')
>>> dg3
4 5 6 7 8 9
Date
2019-09-01 03:00:00 101 14.02 92.08 2.62 174.77 0.109
2019-09-01 06:00:00 102 13.79 92.86 2.79 179.29 0.046
2019-09-01 09:00:00 103 13.81 92.60 2.73 178.94 0.070
2019-09-01 12:00:00 104 13.31 95.20 2.91 179.38 0.015
Question: how do I read in both of these types with the same command? The only way I can think of is to first open the file and read in first line to deduce the hour-format (col.3) and the separator? That feels non-pythonic way.
Also, if the hour-reading float is e.g. 3.75, it would be ok to round that into nearest integer and just set the minute reading to 0.
I have the following dataframe
Timestamp KE EH LA PR
0 2013-02-27 00:00:00 1.000000 2.0 0.03 201.289993
1 2013-02-27 00:15:00 0.990000 2.0 0.03 210.070007
2 2013-02-27 00:30:00 0.950000 2.0 0.02 207.779999
3 2013-02-27 00:45:00 0.990000 2.0 0.03 151.960007
4 2013-02-27 01:00:00 341.209991 2.0 0.04 0.000000
... ... ... ... ... ...
3449 2013-04-03 22:15:00 NaN 2.0 0.03 0.000000
3450 2013-04-03 22:30:00 NaN NaN 0.07 0.000000
I would like to turn the timestamp column into a numeric value while keeping only the 'time' like illustrated below
Timestamp KE EH LA PR
0 0 1.000000 2.0 0.03 201.289993
1 15 0.990000 2.0 0.03 210.070007
2 30 0.950000 2.0 0.02 207.779999
3 45 0.990000 2.0 0.03 151.960007
4 100 341.209991 2.0 0.04 0.000000
... ... ... ... ... ...
3449 2215 NaN 2.0 0.03 0.000000
3450 2230 NaN NaN 0.07 0.000000
I tried the code below based on this post[enter link description here][1]
h3_hours["Timestamp"] = h3_hours["Timestamp"].astype(str).astype(int)
But this gives me the error below
ValueError: invalid literal for int() with base 10: '2013-02-27 00:00:00'
Then I found online this approach
# Turn datetime into hours and minutes
h3_hours["Timestamp"] = h3_hours["Timestamp"].dt.strftime("%H:%M")
h3_hours["Timestamp"] = h3_hours["Timestamp"].astype(int)
But this gives me the same error as above.
Any idea on how I can turn my Timestamp column into integer hours?
[1]: Pandas: convert dtype 'object' to int
You are close, need remove : in %H%M for HHMM format:
h3_hours["Timestamp"] = h3_hours["Timestamp"].dt.strftime("%H%M").astype(int)
i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps
I have a dataframe, which has different rates for multiple 'N' currencies over a time period.
dataframe
Dates AUD CAD CHF GBP EUR
20/05/2019 0.11 -0.25 -0.98 0.63 0.96
21/05/2019 0.14 -0.35 -0.92 1.92 0.92
...
02/01/2020 0.135 -0.99 -1.4 0.93 0.83
Firstly, I would like to reshape the dataframe table to look like the below as I would like to join another table which would be in a similar format:
dataframe
Dates Pairs Rates
20/05/2019 AUD 0.11
20/05/2019 CAD -0.25
20/05/2019 CHF -0.98
...
...
02/01/2020 AUD 0.135
02/01/2020 CAD -0.99
02/01/2020 CHF -1.4
Then, for every N currency, I would like to plot a histogram . So with the above, it would be 5 separate histograms based off each N ccy.
I assume I would need to get this in some sort of loop, but not sure on the easiest way to approach.
Thanks
Use DataFrame.melt first:
df['Dates'] = pd.to_datetime(df['Dates'], dayfirst=True)
df = df.melt('Dates', var_name='Pairs', value_name='Rates')
print (df)
Dates Pairs Rates
0 2019-05-20 AUD 0.110
1 2019-05-21 AUD 0.140
2 2020-01-02 AUD 0.135
3 2019-05-20 CAD -0.250
4 2019-05-21 CAD -0.350
5 2020-01-02 CAD -0.990
6 2019-05-20 CHF -0.980
7 2019-05-21 CHF -0.920
8 2020-01-02 CHF -1.400
9 2019-05-20 GBP 0.630
10 2019-05-21 GBP 1.920
11 2020-01-02 GBP 0.930
12 2019-05-20 EUR 0.960
13 2019-05-21 EUR 0.920
14 2020-01-02 EUR 0.830
And then DataFrameGroupBy.hist:
df.groupby('Pairs').hist()
I have some data with multiple observations for a given Collector, Date, Sample, and Type where the observation values vary by ID.
import StringIO
import pandas as pd
data = """Collector,Date,Sample,Type,ID,Value
Emily,2014-06-20,201,HV,A,34
Emily,2014-06-20,201,HV,B,22
Emily,2014-06-20,201,HV,C,10
Emily,2014-06-20,201,HV,D,5
John,2014-06-22,221,HV,A,40
John,2014-06-22,221,HV,B,39
John,2014-06-22,221,HV,C,11
John,2014-06-22,221,HV,D,2
Emily,2014-06-23,203,HV,A,33
Emily,2014-06-23,203,HV,B,35
Emily,2014-06-23,203,HV,C,13
Emily,2014-06-23,203,HV,D,1
John,2014-07-01,218,HV,A,35
John,2014-07-01,218,HV,B,29
John,2014-07-01,218,HV,C,13
John,2014-07-01,218,HV,D,1
"""
>>> df = pd.read_csv(StringIO.StringIO(data), parse_dates="Date")
After doing some graphing with the data in this long format, I pivot it to a wide summary table format with columns for each ID.
>>> table = df.pivot_table(index=["Collector", "Date", "Sample", "Type"], columns="ID", values="Value")
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34 22 10 5
2014-06-23 203 HV 33 35 13 1
John 2014-06-22 221 HV 40 39 11 2
2014-07-01 218 HV 35 29 13 1
However, I can't find a concise way to calculate and add some summary rows to the wide format data with mean, median, and maybe some custom aggregation function applied to each of the ID-based columns. This is what I want to end up with:
ID Collector Date Sample Type A B C D
0 Emily 2014-06-20 201 HV 34 22 10 5
2 John 2014-06-22 221 HV 40 39 11 2
1 Emily 2014-06-23 203 HV 33 35 13 1
3 John 2014-07-01 218 HV 35 29 13 1
4 mean 35.5 31.3 11.8 2.3
5 median 34.5 32.0 12.0 1.5
I tried things like calling mean or median on the summary table, but I end up with a Series rather than a row I can concatenate to the summary table. The summary rows I want are sort of like pivot_table margins, but the aggregation function is not sum.
>>> table.mean()
ID
A 35.50
B 31.25
C 11.75
D 2.25
dtype: float64
>>> table.median()
ID
A 34.5
B 32.0
C 12.0
D 1.5
dtype: float64
You could use aggfunc=[np.mean, np.median] to compute both the means and the medians. Then you could use margins=True to also obtain the means and medians for each column and for each row.
result = df.pivot_table(index=["Collector", "Date", "Sample", "Type"],
columns="ID", values="Value", margins=True,
aggfunc=[np.mean, np.median]).stack(level=0)
yields
ID A B C D All
Collector Date Sample Type
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
median 34.0 22.00 10.00 5.00 16.0000
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
median 33.0 35.00 13.00 1.00 23.0000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
median 40.0 39.00 11.00 2.00 25.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
median 35.0 29.00 13.00 1.00 21.0000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Yes, result contains more data than you asked for, but
result.loc['All']
has the additional values:
ID A B C D All
Date Sample Type
mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Or, you could further subselect result to get just the rows you are looking for:
result.index.names = [u'Collector', u'Date', u'Sample', u'Type', u'aggfunc']
mask = result.index.get_level_values('aggfunc') == 'mean'
mask[-1] = True
result = result.loc[mask]
print(result)
yields
ID A B C D All
Collector Date Sample Type aggfunc
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
This might not be super clean, but you could assign to the new entries with .loc.
In [131]: table_mean = table.mean()
In [132]: table_median = table.median()
In [134]: table.loc['Mean', :] = table_mean.values
In [135]: table.loc['Median', :] = table_median.values
In [136]: table
Out[136]:
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34.0 22.00 10.00 5.00
2014-06-23 203 HV 33.0 35.00 13.00 1.00
John 2014-06-22 221 HV 40.0 39.00 11.00 2.00
2014-07-01 218 HV 35.0 29.00 13.00 1.00
Mean 35.5 31.25 11.75 2.25
Median 34.5 32.00 12.00 1.50