pandas create dataframe from two files

pandas create dataframe from two files - python

I have two txt files...Structure first one is:
:Data_list: 20160203_Gs_xr_1m.txt
:Created: 2016 Feb 04 0010 UTC
# Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
# Please send comments and suggestions to SWPC.Webmaster#noaa.gov
#
# Label: Short = 0.05- 0.4 nanometer
# Label: Long = 0.1 - 0.8 nanometer
# Units: Short = Watts per meter squared
# Units: Long = Watts per meter squared
# Source: GOES-13
# Location: W075
# Missing data: -1.00e+05
#
# 1-minute GOES-13 Solar X-ray Flux
#
# Modified Seconds
# UTC Date Time Julian of the
# YR MO DA HHMM Day Day Short Long
#-------------------------------------------------------
2016 02 03 0000 57421 0 2.13e-09 4.60e-07
2016 02 03 0001 57421 60 1.84e-09 4.51e-07
2016 02 03 0002 57421 120 1.79e-09 4.52e-07
2016 02 03 0003 57421 180 1.58e-09 4.58e-07
2016 02 03 0004 57421 240 2.51e-09 4.56e-07
2016 02 03 0005 57421 300 4.30e-09 4.48e-07
2016 02 03 0006 57421 360 1.97e-09 4.47e-07
2016 02 03 0007 57421 420 2.46e-09 4.47e-07
2016 02 03 0008 57421 480 3.10e-09 4.51e-07
2016 02 03 0009 57421 540 3.24e-09 4.43e-07
2016 02 03 0010 57421 600 2.92e-09 4.34e-07
2016 02 03 0011 57421 660 2.42e-09 4.35e-07
2016 02 03 0012 57421 720 2.90e-09 4.40e-07
2016 02 03 0013 57421 780 1.87e-09 4.36e-07
2016 02 03 0014 57421 840 1.31e-09 4.37e-07
2016 02 03 0015 57421 900 2.50e-09 4.41e-07
2016 02 03 0016 57421 960 1.52e-09 4.42e-07
2016 02 03 0017 57421 1020 1.36e-09 4.41e-07
2016 02 03 0018 57421 1080 1.33e-09 4.35e-07
2016 02 03 0019 57421 1140 2.20e-09 4.37e-07
2016 02 03 0020 57421 1200 2.90e-09 4.53e-07
2016 02 03 0021 57421 1260 1.39e-09 4.75e-07
2016 02 03 0022 57421 1320 4.55e-09 4.67e-07
2016 02 03 0023 57421 1380 2.30e-09 4.58e-07
2016 02 03 0024 57421 1440 3.99e-09 4.53e-07
2016 02 03 0025 57421 1500 3.93e-09 4.40e-07
2016 02 03 0026 57421 1560 1.70e-09 4.34e-07
.
.
.
2016 02 03 2344 57421 85440 3.77e-09 5.00e-07
2016 02 03 2345 57421 85500 3.76e-09 4.96e-07
2016 02 03 2346 57421 85560 1.64e-09 4.97e-07
2016 02 03 2347 57421 85620 3.59e-09 5.04e-07
2016 02 03 2348 57421 85680 2.55e-09 5.04e-07
2016 02 03 2349 57421 85740 2.30e-09 5.11e-07
2016 02 03 2350 57421 85800 2.95e-09 5.09e-07
2016 02 03 2351 57421 85860 4.25e-09 5.02e-07
2016 02 03 2352 57421 85920 4.78e-09 5.02e-07
2016 02 03 2353 57421 85980 3.04e-09 5.01e-07
2016 02 03 2354 57421 86040 3.30e-09 5.10e-07
2016 02 03 2355 57421 86100 2.22e-09 5.16e-07
2016 02 03 2356 57421 86160 4.12e-09 5.15e-07
2016 02 03 2357 57421 86220 4.25e-09 5.16e-07
2016 02 03 2358 57421 86280 3.48e-09 5.20e-07
2016 02 03 2359 57421 86340 4.19e-09 5.27e-07
And second one:
:Data_list: 20160204_Gs_xr_1m.txt
:Created: 2016 Feb 04 1301 UTC
# Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
# Please send comments and suggestions to SWPC.Webmaster#noaa.gov
#
# Label: Short = 0.05- 0.4 nanometer
# Label: Long = 0.1 - 0.8 nanometer
# Units: Short = Watts per meter squared
# Units: Long = Watts per meter squared
# Source: GOES-13
# Location: W075
# Missing data: -1.00e+05
#
# 1-minute GOES-13 Solar X-ray Flux
#
# Modified Seconds
# UTC Date Time Julian of the
# YR MO DA HHMM Day Day Short Long
#-------------------------------------------------------
2016 02 04 0000 57422 0 4.85e-09 5.28e-07
2016 02 04 0001 57422 60 3.07e-09 5.29e-07
2016 02 04 0002 57422 120 4.48e-09 5.26e-07
2016 02 04 0003 57422 180 3.21e-09 5.17e-07
2016 02 04 0004 57422 240 4.23e-09 5.18e-07
2016 02 04 0005 57422 300 4.55e-09 5.21e-07
2016 02 04 0006 57422 360 3.30e-09 5.31e-07
2016 02 04 0007 57422 420 5.29e-09 5.49e-07
2016 02 04 0008 57422 480 3.14e-09 5.65e-07
2016 02 04 0009 57422 540 6.59e-09 5.70e-07
2016 02 04 0010 57422 600 6.04e-09 5.62e-07
2016 02 04 0011 57422 660 5.31e-09 5.62e-07
2016 02 04 0012 57422 720 6.04e-09 5.46e-07
2016 02 04 0013 57422 780 6.81e-09 5.51e-07
2016 02 04 0014 57422 840 6.59e-09 5.65e-07
2016 02 04 0015 57422 900 5.81e-09 5.62e-07
2016 02 04 0016 57422 960 4.63e-09 5.59e-07
2016 02 04 0017 57422 1020 3.05e-09 5.51e-07
2016 02 04 0018 57422 1080 3.26e-09 5.46e-07
2016 02 04 0019 57422 1140 4.59e-09 5.50e-07
2016 02 04 0020 57422 1200 3.38e-09 5.39e-07
2016 02 04 0021 57422 1260 2.43e-09 5.37e-07
2016 02 04 0022 57422 1320 5.31e-09 5.60e-07
2016 02 04 0023 57422 1380 5.63e-09 5.51e-07
2016 02 04 0024 57422 1440 5.18e-09 5.50e-07
2016 02 04 0025 57422 1500 7.06e-09 5.59e-07
2016 02 04 0026 57422 1560 5.01e-09 5.76e-07
2016 02 04 0027 57422 1620 7.17e-09 5.63e-07
2016 02 04 0028 57422 1680 5.74e-09 5.58e-07
2016 02 04 0029 57422 1740 5.55e-09 5.62e-07
2016 02 04 0030 57422 1800 4.99e-09 5.47e-07
2016 02 04 0031 57422 1860 5.49e-09 5.42e-07
2016 02 04 0032 57422 1920 2.14e-09 5.32e-07
2016 02 04 0033 57422 1980 2.48e-09 5.21e-07
2016 02 04 0034 57422 2040 4.35e-09 5.18e-07
2016 02 04 0035 57422 2100 4.84e-09 5.13e-07
2016 02 04 0036 57422 2160 3.12e-09 5.05e-07
2016 02 04 0037 57422 2220 1.18e-09 4.99e-07
2016 02 04 0038 57422 2280 1.59e-09 4.95e-07
Now I need create Pandas dataframe and name three columns...First-time that will be YYYY MM DD HHMM, second xra-penultimate column and xrb-last column...and I need find max of xrb with time ... I think that I know how find max with index with pandas but I dont know how create pandas dataframe...I have problem with 'Header' to 19th line...I need create dataframe from two files without header - only data. And is there any method how read data from some time to some time (time range)?
Thanks for help
edit:
I have this script:
import urllib2
import sys
import datetime
import pandas as pd
xray_flux = urllib2.urlopen('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt')
flux=xray_flux.read()
dataflux= open('xray_flux.txt','w')
dataflux.write(flux)
dataflux.close()
a=pd.read_csv("xray_flux.txt",header=None, sep=" ",error_bad_lines=False,skiprows=19)
print a
df=pd.DataFrame(a)
print df

You can try read_csv and concat:
dateparse = lambda x: pd.datetime.strptime(x, '%Y %m %d %H%M')
#df1 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp1),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse)
print df1.head()
datetime 4 5 6 7
0 2016-02-03 00:00:00 57421 0 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 57421 60 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 57421 120 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 57421 180 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 57421 240 2.510000e-09 4.560000e-07
#df = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse)
print df2.head()
datetime 4 5 6 7
0 2016-02-04 00:00:00 57422 0 4.850000e-09 5.280000e-07
1 2016-02-04 00:01:00 57422 60 3.070000e-09 5.290000e-07
2 2016-02-04 00:02:00 57422 120 4.480000e-09 5.260000e-07
3 2016-02-04 00:03:00 57422 180 3.210000e-09 5.170000e-07
4 2016-02-04 00:04:00 57422 240 4.230000e-09 5.180000e-07
df = pd.concat([df1[['datetime',6,7]],df2[['datetime',6,7]]])
df.columns = ['datetime','xra','xrb']
print df.head(10)
datetime xra xrb
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 2.510000e-09 4.560000e-07
5 2016-02-03 00:05:00 4.300000e-09 4.480000e-07
6 2016-02-03 00:06:00 1.970000e-09 4.470000e-07
7 2016-02-03 00:07:00 2.460000e-09 4.470000e-07
8 2016-02-03 00:08:00 3.100000e-09 4.510000e-07
9 2016-02-03 00:09:00 3.240000e-09 4.430000e-07
EDIT:
Also you can use parameter usecols in read_csv for filtering columns - you need only columns datetime, 6 and 7. Then you can use all df1 and df2 in pd.concat:
#df1 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
df1 = pd.read_csv(io.StringIO(temp1),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse,
usecols=[0, 1, 2, 3, 6, 7])
print df1.head()
datetime 6 7
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
#df2 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse,
usecols=[0, 1, 2, 3, 6, 7])
print df2.head()
datetime 6 7
0 2016-02-04 00:00:00 4.850000e-09 5.280000e-07
1 2016-02-04 00:01:00 3.070000e-09 5.290000e-07
2 2016-02-04 00:02:00 4.480000e-09 5.260000e-07
3 2016-02-04 00:03:00 3.210000e-09 5.170000e-07
4 2016-02-04 00:04:00 4.230000e-09 5.180000e-07
df = pd.concat([df1,df2])
df.columns = ['datetime','xra','xrb']
print df
datetime xra xrb
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 2.510000e-09 4.560000e-07
5 2016-02-03 00:05:00 4.300000e-09 4.480000e-07
6 2016-02-03 00:06:00 1.970000e-09 4.470000e-07
7 2016-02-03 00:07:00 2.460000e-09 4.470000e-07
8 2016-02-03 00:08:00 3.100000e-09 4.510000e-07
9 2016-02-03 00:09:00 3.240000e-09 4.430000e-07

Related

How to groupby multiple columns and unstack get percentage of each cell by dividing from row total in Python?

My Question is as follows, i have a data set ~ 700mb which looks like
rpt_period_name_week period_name_mth assigned_date_utc resolved_date_utc handle_seconds action marketplace_id login category currency_code order_amount_in_usd day_of_week_NewClmn
2020 Week 01 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 84 Pass DE a MRI AT EUR 81.32 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 37 Pass DE b MRI AQ EUR 222.38 Saturday
2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 123 Pass DE a MRI DG EUR 444.77 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:59 313 Hold JP a MRI AQ Saturday
2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 112 Pass FR b MRI DG EUR 582.53 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:58 249 Pass DE f MRI AT EUR 443.16 Saturday
2020 Week 03 2020 / 01 1/11/2020 23:58 1/11/2020 23:58 48 Pass DE b MRI DG EUR 20.5 Saturday
2020 Week 03 2020 / 01 1/11/2020 23:57 1/11/2020 23:58 40 Pass IT a MRI AQ EUR 272.01 Saturday
my desired output is like
[Output][1]
https://i.stack.imgur.com/8oz7G.png
My code is below but i am unable to get the desire result? My cells are getting divided by sum of row?
Have tried multiple options but in vain?
df = data_final.groupby(['login','category','rpt_period_name_week','action'])['action'].agg(np.count_nonzero).unstack(['rpt_period_name_week','action']).apply(lambda x: x.fillna(0))
df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
# df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
df1 = df.astype(str) + '%'
# print (df1)
Please help?

Changing the date format of an entire Dataframe column when multiple date formats already exist in the column?

bond_df['Maturity']
0 2022-07-15 00:00:00
1 2024-07-18 00:00:00
2 2027-07-16 00:00:00
3 2020-07-28 00:00:00
4 2019-10-09 00:00:00
5 2022-04-08 00:00:00
6 2020-12-15 00:00:00
7 2022-12-15 00:00:00
8 2026-04-08 00:00:00
9 2023-04-11 00:00:00
10 2024-12-15 00:00:00
11 2019
12 2020-10-25 00:00:00
13 2024-04-22 00:00:00
14 2047-12-15 00:00:00
15 2020-07-08 00:00:00
17 2043-04-11 00:00:00
18 2021
19 2022
20 2023
21 2025
22 2026
23 2027
24 2029
25 2021-04-15 00:00:00
26 2044-04-22 00:00:00
27 2043-10-02 00:00:00
28 2039-01-19 00:00:00
29 2040-07-09 00:00:00
30 2029-09-21 00:00:00
31 2040-10-25 00:00:00
32 2019
33 2035-09-04 00:00:00
34 2035-09-28 00:00:00
35 2041-04-15 00:00:00
36 2040-04-02 00:00:00
37 2034-03-27 00:00:00
38 2030
39 2027-04-05 00:00:00
40 2038-04-15 00:00:00
41 2037-08-17 00:00:00
42 2023-10-16 00:00:00
43 -
45 2019-10-09 00:00:00
46 -
47 2021-06-23 00:00:00
48 2021-06-23 00:00:00
49 2023-06-26 00:00:00
50 2025-06-26 00:00:00
51 2028-06-26 00:00:00
52 2038-06-28 00:00:00
53 2020-06-23 00:00:00
54 2020-06-23 00:00:00
55 2048-06-29 00:00:00
56 -
57 -
58 2029-07-08 00:00:00
59 2026-07-08 00:00:00
60 2024-07-08 00:00:00
61 2020-07-31 00:00:00
Name: Maturity, dtype: object
This is a column of data that I imported from Excel of maturity dates for various Walmart bonds. All I am concerned with is the year portion of these dates. How can I format the entire column to just return the year values?
dt.strftime didn't work
Thanks in advance

Wrote this little script for you which should output the years in a years.txt file, assuming your data is in data.txt as only the years you posted above.
Script also lets you toggle if you want to include the dash and the years on the right.
Contents of the data.txt I tested with:
0 2022-07-15 00:00:00
1 2024-07-18 00:00:00
2 2027-07-16 00:00:00
3 2020-07-28 00:00:00
4 2019-10-09 00:00:00
5 2022-04-08 00:00:00
6 2020-12-15 00:00:00
7 2022-12-15 00:00:00
8 2026-04-08 00:00:00
9 2023-04-11 00:00:00
10 2024-12-15 00:00:00
11 2019
12 2020-10-25 00:00:00
13 2024-04-22 00:00:00
14 2047-12-15 00:00:00
15 2020-07-08 00:00:00
17 2043-04-11 00:00:00
18 2021
19 2022
20 2023
21 2025
22 2026
23 2027
24 2029
25 2021-04-15 00:00:00
26 2044-04-22 00:00:00
27 2043-10-02 00:00:00
28 2039-01-19 00:00:00
29 2040-07-09 00:00:00
30 2029-09-21 00:00:00
31 2040-10-25 00:00:00
32 2019
33 2035-09-04 00:00:00
34 2035-09-28 00:00:00
35 2041-04-15 00:00:00
36 2040-04-02 00:00:00
37 2034-03-27 00:00:00
38 2030
39 2027-04-05 00:00:00
40 2038-04-15 00:00:00
41 2037-08-17 00:00:00
42 2023-10-16 00:00:00
43 -
45 2019-10-09 00:00:00
46 -
47 2021-06-23 00:00:00
48 2021-06-23 00:00:00
49 2023-06-26 00:00:00
50 2025-06-26 00:00:00
51 2028-06-26 00:00:00
52 2038-06-28 00:00:00
53 2020-06-23 00:00:00
54 2020-06-23 00:00:00
55 2048-06-29 00:00:00
56 -
57 -
58 2029-07-08 00:00:00
59 2026-07-08 00:00:00
60 2024-07-08 00:00:00
61 2020-07-31 00:00:00
and the script I wrote:
#!/usr/bin/python3
all_years = []
include_dash = False
include_years_on_right = True
with open("data.txt", "r") as f:
text = f.read()
lines = text.split("\n")
for line in lines:
line = line.strip()
if line == "":
continue
if "00" in line:
all_years.append(line.split("-")[0].split()[-1])
else:
if include_years_on_right == False:
continue
year = line.split(" ")[-1]
if year == "-":
if include_dash == True:
all_years.append(year)
else:
continue
else:
all_years.append(year)
with open("years.txt", "w") as f:
for year in all_years:
f.write(year + "\n")
and the output to the years.txt:
2022
2024
2027
2020
2019
2022
2020
2022
2026
2023
2024
2019
2020
2024
2047
2020
2043
2021
2022
2023
2025
2026
2027
2029
2021
2044
2043
2039
2040
2029
2040
2019
2035
2035
2041
2040
2034
2030
2027
2038
2037
2023
2019
2021
2021
2023
2025
2028
2038
2020
2020
2048
2029
2026
2024
2020
Contact me if you have any issues, and I hope I can help you!

Divide Single to Multiple Columns by Delimiter Python Dataframe

i have a dataframe called "dates" with shape 4380,1 that looks like this -
date
0 2017-01-01 00:00:00
1 2017-01-01 06:00:00
2 2017-01-01 12:00:00
3 2017-01-01 18:00:00
4 2017-01-02 00:00:00
...
4375 2019-12-30 18:00:00
4376 2019-12-31 00:00:00
4377 2019-12-31 06:00:00
4378 2019-12-31 12:00:00
4379 2019-12-31 18:00:00
but i need to divide the single column of dates by the delimiter "-" or dash so that I can use this to groupby the month e.g., 01, 02,...12. So, my final result for the new dataframe should have shape 4380,4 and look like:
Year Month Day HHMMSS
0 2017 01 01 00:00:00
1 2017 01 01 06:00:00
...
4379 2019 12 31 18:00:00
I cannot find how to do this python transformation from single to multiple columns based on a delimiter. Thank you much!

Use Series.dt.strftime and Series.str.split:
new_df = df['date'].dt.strftime('%Y-%m-%d-%H:%M:%S').str.split('-',expand=True)
new_df.columns = ['Year','Month','Day', 'HHMMSS']
print(new_df)
Year Month Day HHMMSS
0 2017 01 01 00:00:00
1 2017 01 01 06:00:00
2 2017 01 01 12:00:00
3 2017 01 01 18:00:00
4 2017 01 02 00:00:00
4375 2019 12 30 18:00:00
4376 2019 12 31 00:00:00
4377 2019 12 31 06:00:00
4378 2019 12 31 12:00:00
4379 2019 12 31 18:00:00

Adding missing rows of pandas DataFrame when index contains duplicate data

I have a DataFrame with dtype=object as:
YY MM DD hh var1 var2
.
.
.
10512 2013 01 01 06 1.64 4.64
10513 2013 01 01 07 1.57 4.63
10514 2013 01 01 08 1.56 4.71
10515 2013 01 01 09 1.45 4.69
10516 2013 01 01 10 1.53 4.67
10517 2013 01 01 11 1.31 4.63
10518 2013 01 01 12 1.41 4.70
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 20 1.15 4.91
10521 2013 01 01 21 1.14 4.74
10522 2013 01 01 22 1.10 4.95
As seen, there are missing rows corresponding to hours (hh) (for instance between 10519 and 10520 rows, hh jumps from 13 to 20). I tried to add the gap by setting hh as index, as what was discussed here: Missing data, insert rows in Pandas and fill with NAN
df=df.set_index('hh')
new_index = pd.Index(np.arange(0,24), name="hh")
df=df.reindex(new_index).reset_index()
and reach something like:
YY MM DD hh var1 var2
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 14 Nan Nan
10521 2013 01 01 15 Nan Nan
10522 2013 01 01 16 Nan Nan
...
10523 2013 01 01 20 1.15 4.91
10524 2013 01 01 21 1.14 4.74
10525 2013 01 01 22 1.10 4.95
But I encounter the error "cannot reindex from a duplicate axis" for the part df=df.reindex(new_index).
There are duplicate values for each hh=0,1,...,23, because same value of hh would be repeated for different months (MM) and years (YY).
Probably that's the reason. How can I solve the problem?
In general,how can one fills the missing rows of pandas DataFrame when index contains duplicate data. I appreciate any comments.

First create a new column with the time, including date and hour, of type datetime. One way this can be done is as follows:
df = df.rename(columns={'YY': 'year', 'MM': 'month', 'DD': 'day', 'hh': 'hour'})
df['time'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
To use to_datetime in this way, the column names need to be year, month, day and hour which is why rename is used.
To get the expected result, set this new column as the index and use resample:
df.set_index('time').resample('H').mean()

This code does exactly what you need.
import pandas as pd
import numpy as np
from io import StringIO
YY, MM, DD, hh, var1, var2 = [],[],[],[],[],[]
a = '''10512 2013 01 01 06 1.64 4.64
10513 2013 01 01 07 1.57 4.63
10514 2013 01 01 08 1.56 4.71
10515 2013 01 01 09 1.45 4.69
10516 2013 01 01 10 1.53 4.67
10517 2013 01 01 11 1.31 4.63
10518 2013 01 01 12 1.41 4.70
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 20 1.15 4.91
10521 2013 01 01 21 1.14 4.74
10522 2013 01 01 22 1.10 4.95
10523 2013 01 01 27 1.30 4.55
10524 2013 01 01 28 1.2 4.62
'''
text = StringIO(a)
for line in text.readlines():
a = line.strip().split(" ")
a = list(filter(None, a))
YY.append(a[1])
MM.append(a[2])
DD.append(a[3])
hh.append(a[4])
var1.append(a[5])
var2.append(a[6])
df = pd.DataFrame({'YY':YY, 'MM':MM, 'DD':DD,
'hh':hh, 'var1':var1, 'var2':var2})
df['hh'] = df.hh.astype(int)
a = np.diff(df.hh)
b = np.where(a!=1)
df2 = df.copy(deep=True)
for i in range(len(df)):
if (i in b[0]):
line = pd.DataFrame(columns=['YY', 'MM', 'DD',
'hh', 'var1', 'var2'])
for k in range(a[i]-1):
line.loc[k]=[df2.iloc[i, 0], df2.iloc[i, 1],
df2.iloc[i, 2], df2.iloc[i, 3]+k+1 ,
np.nan, np.nan]
df = pd.concat([df.loc[:i],
line, df.loc[i+1:]])
df.reset_index(inplace=True, drop=True)
print(df)
YY MM DD hh var1 var2
0 2013 01 01 6 1.64 4.64
1 2013 01 01 7 1.57 4.63
2 2013 01 01 8 1.56 4.71
3 2013 01 01 9 1.45 4.69
4 2013 01 01 10 1.53 4.67
5 2013 01 01 11 1.31 4.63
6 2013 01 01 12 1.41 4.70
7 2013 01 01 13 1.49 4.80
8 2013 01 01 14 NaN NaN
9 2013 01 01 15 NaN NaN
10 2013 01 01 16 NaN NaN
11 2013 01 01 17 NaN NaN
12 2013 01 01 18 NaN NaN
13 2013 01 01 19 NaN NaN
14 2013 01 01 20 1.15 4.91
15 2013 01 01 21 1.14 4.74
16 2013 01 01 22 1.10 4.95
17 2013 01 01 23 NaN NaN
18 2013 01 01 24 NaN NaN
19 2013 01 01 25 NaN NaN
20 2013 01 01 26 NaN NaN
21 2013 01 01 27 1.30 4.55
22 2013 01 01 28 1.2 4.62

Python: Grouping by date and finding the average of a column inside a dataframe

I have a data frame that has a 3 columns.
Time represents every day of the month for various months. what I am trying to do is get the 'Count' value per day and average it per each month, and do this for each country. The output must be in the form of a data frame.
Curent data:
Time Country Count
2017-01-01 us 7827
2017-01-02 us 7748
2017-01-03 us 7653
..
..
2017-01-30 us 5432
2017-01-31 us 2942
2017-01-01 us 5829
2017-01-02 ca 9843
2017-01-03 ca 7845
..
..
2017-01-30 ca 8654
2017-01-31 ca 8534
Desire output (dummy data, numbers are not representative of the DF above):
Time Country Monthly Average
Jan 2017 us 6873
Feb 2017 us 8875
..
..
Nov 2017 us 9614
Dec 2017 us 2475
Jan 2017 ca 1878
Feb 2017 ca 4775
..
..
Nov 2017 ca 7643
Dec 2017 ca 9441

I'd organize it like this:
df.groupby(
[df.Time.dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0
If your 'Time' column wasn't already a datetime column, I'd do this:
df.groupby(
[pd.to_datetime(df.Time).dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0

Use pandas dt strftime to create a month-year column that you desire + groupby + mean. Used this dataframe:
Dated country num
2017-01-01 us 12
2017-01-02 us 12
2017-02-02 us 134
2017-02-03 us 76
2017-03-30 us 54
2017-01-31 us 29
2017-01-01 us 58
2017-01-02 us 12
2017-02-02 ca 98
2017-02-03 ca 78
2017-03-30 ca 86
2017-01-31 ca 85
Then create a Month-Year column:
a['MonthYear']= a.Dated.dt.strftime('%b %Y')
Then, drop the Date column and aggregate by mean:
a.drop('Dated', axis=1).groupby(['MonthYear','country']).mean().rename(columns={'num':'Averaged'}).reset_index()
MonthYear country Averaged
Feb 2017 ca 88.0
Feb 2017 us 105.0
Jan 2017 ca 85.0
Jan 2017 us 24.6
Mar 2017 ca 86.0
Mar 2017 us 54.0
I retained the Dated column just in case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas create dataframe from two files - python

Related

How to groupby multiple columns and unstack get percentage of each cell by dividing from row total in Python?

Changing the date format of an entire Dataframe column when multiple date formats already exist in the column?

Divide Single to Multiple Columns by Delimiter Python Dataframe

Adding missing rows of pandas DataFrame when index contains duplicate data

Python: Grouping by date and finding the average of a column inside a dataframe

Categories

Resources