pandas create dataframe from two files - python
I have two txt files...Structure first one is:
:Data_list: 20160203_Gs_xr_1m.txt
:Created: 2016 Feb 04 0010 UTC
# Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
# Please send comments and suggestions to SWPC.Webmaster#noaa.gov
#
# Label: Short = 0.05- 0.4 nanometer
# Label: Long = 0.1 - 0.8 nanometer
# Units: Short = Watts per meter squared
# Units: Long = Watts per meter squared
# Source: GOES-13
# Location: W075
# Missing data: -1.00e+05
#
# 1-minute GOES-13 Solar X-ray Flux
#
# Modified Seconds
# UTC Date Time Julian of the
# YR MO DA HHMM Day Day Short Long
#-------------------------------------------------------
2016 02 03 0000 57421 0 2.13e-09 4.60e-07
2016 02 03 0001 57421 60 1.84e-09 4.51e-07
2016 02 03 0002 57421 120 1.79e-09 4.52e-07
2016 02 03 0003 57421 180 1.58e-09 4.58e-07
2016 02 03 0004 57421 240 2.51e-09 4.56e-07
2016 02 03 0005 57421 300 4.30e-09 4.48e-07
2016 02 03 0006 57421 360 1.97e-09 4.47e-07
2016 02 03 0007 57421 420 2.46e-09 4.47e-07
2016 02 03 0008 57421 480 3.10e-09 4.51e-07
2016 02 03 0009 57421 540 3.24e-09 4.43e-07
2016 02 03 0010 57421 600 2.92e-09 4.34e-07
2016 02 03 0011 57421 660 2.42e-09 4.35e-07
2016 02 03 0012 57421 720 2.90e-09 4.40e-07
2016 02 03 0013 57421 780 1.87e-09 4.36e-07
2016 02 03 0014 57421 840 1.31e-09 4.37e-07
2016 02 03 0015 57421 900 2.50e-09 4.41e-07
2016 02 03 0016 57421 960 1.52e-09 4.42e-07
2016 02 03 0017 57421 1020 1.36e-09 4.41e-07
2016 02 03 0018 57421 1080 1.33e-09 4.35e-07
2016 02 03 0019 57421 1140 2.20e-09 4.37e-07
2016 02 03 0020 57421 1200 2.90e-09 4.53e-07
2016 02 03 0021 57421 1260 1.39e-09 4.75e-07
2016 02 03 0022 57421 1320 4.55e-09 4.67e-07
2016 02 03 0023 57421 1380 2.30e-09 4.58e-07
2016 02 03 0024 57421 1440 3.99e-09 4.53e-07
2016 02 03 0025 57421 1500 3.93e-09 4.40e-07
2016 02 03 0026 57421 1560 1.70e-09 4.34e-07
.
.
.
2016 02 03 2344 57421 85440 3.77e-09 5.00e-07
2016 02 03 2345 57421 85500 3.76e-09 4.96e-07
2016 02 03 2346 57421 85560 1.64e-09 4.97e-07
2016 02 03 2347 57421 85620 3.59e-09 5.04e-07
2016 02 03 2348 57421 85680 2.55e-09 5.04e-07
2016 02 03 2349 57421 85740 2.30e-09 5.11e-07
2016 02 03 2350 57421 85800 2.95e-09 5.09e-07
2016 02 03 2351 57421 85860 4.25e-09 5.02e-07
2016 02 03 2352 57421 85920 4.78e-09 5.02e-07
2016 02 03 2353 57421 85980 3.04e-09 5.01e-07
2016 02 03 2354 57421 86040 3.30e-09 5.10e-07
2016 02 03 2355 57421 86100 2.22e-09 5.16e-07
2016 02 03 2356 57421 86160 4.12e-09 5.15e-07
2016 02 03 2357 57421 86220 4.25e-09 5.16e-07
2016 02 03 2358 57421 86280 3.48e-09 5.20e-07
2016 02 03 2359 57421 86340 4.19e-09 5.27e-07
And second one:
:Data_list: 20160204_Gs_xr_1m.txt
:Created: 2016 Feb 04 1301 UTC
# Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
# Please send comments and suggestions to SWPC.Webmaster#noaa.gov
#
# Label: Short = 0.05- 0.4 nanometer
# Label: Long = 0.1 - 0.8 nanometer
# Units: Short = Watts per meter squared
# Units: Long = Watts per meter squared
# Source: GOES-13
# Location: W075
# Missing data: -1.00e+05
#
# 1-minute GOES-13 Solar X-ray Flux
#
# Modified Seconds
# UTC Date Time Julian of the
# YR MO DA HHMM Day Day Short Long
#-------------------------------------------------------
2016 02 04 0000 57422 0 4.85e-09 5.28e-07
2016 02 04 0001 57422 60 3.07e-09 5.29e-07
2016 02 04 0002 57422 120 4.48e-09 5.26e-07
2016 02 04 0003 57422 180 3.21e-09 5.17e-07
2016 02 04 0004 57422 240 4.23e-09 5.18e-07
2016 02 04 0005 57422 300 4.55e-09 5.21e-07
2016 02 04 0006 57422 360 3.30e-09 5.31e-07
2016 02 04 0007 57422 420 5.29e-09 5.49e-07
2016 02 04 0008 57422 480 3.14e-09 5.65e-07
2016 02 04 0009 57422 540 6.59e-09 5.70e-07
2016 02 04 0010 57422 600 6.04e-09 5.62e-07
2016 02 04 0011 57422 660 5.31e-09 5.62e-07
2016 02 04 0012 57422 720 6.04e-09 5.46e-07
2016 02 04 0013 57422 780 6.81e-09 5.51e-07
2016 02 04 0014 57422 840 6.59e-09 5.65e-07
2016 02 04 0015 57422 900 5.81e-09 5.62e-07
2016 02 04 0016 57422 960 4.63e-09 5.59e-07
2016 02 04 0017 57422 1020 3.05e-09 5.51e-07
2016 02 04 0018 57422 1080 3.26e-09 5.46e-07
2016 02 04 0019 57422 1140 4.59e-09 5.50e-07
2016 02 04 0020 57422 1200 3.38e-09 5.39e-07
2016 02 04 0021 57422 1260 2.43e-09 5.37e-07
2016 02 04 0022 57422 1320 5.31e-09 5.60e-07
2016 02 04 0023 57422 1380 5.63e-09 5.51e-07
2016 02 04 0024 57422 1440 5.18e-09 5.50e-07
2016 02 04 0025 57422 1500 7.06e-09 5.59e-07
2016 02 04 0026 57422 1560 5.01e-09 5.76e-07
2016 02 04 0027 57422 1620 7.17e-09 5.63e-07
2016 02 04 0028 57422 1680 5.74e-09 5.58e-07
2016 02 04 0029 57422 1740 5.55e-09 5.62e-07
2016 02 04 0030 57422 1800 4.99e-09 5.47e-07
2016 02 04 0031 57422 1860 5.49e-09 5.42e-07
2016 02 04 0032 57422 1920 2.14e-09 5.32e-07
2016 02 04 0033 57422 1980 2.48e-09 5.21e-07
2016 02 04 0034 57422 2040 4.35e-09 5.18e-07
2016 02 04 0035 57422 2100 4.84e-09 5.13e-07
2016 02 04 0036 57422 2160 3.12e-09 5.05e-07
2016 02 04 0037 57422 2220 1.18e-09 4.99e-07
2016 02 04 0038 57422 2280 1.59e-09 4.95e-07
Now I need create Pandas dataframe and name three columns...First-time that will be YYYY MM DD HHMM, second xra-penultimate column and xrb-last column...and I need find max of xrb with time ... I think that I know how find max with index with pandas but I dont know how create pandas dataframe...I have problem with 'Header' to 19th line...I need create dataframe from two files without header - only data. And is there any method how read data from some time to some time (time range)?
Thanks for help
edit:
I have this script:
import urllib2
import sys
import datetime
import pandas as pd
xray_flux = urllib2.urlopen('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt')
flux=xray_flux.read()
dataflux= open('xray_flux.txt','w')
dataflux.write(flux)
dataflux.close()
a=pd.read_csv("xray_flux.txt",header=None, sep=" ",error_bad_lines=False,skiprows=19)
print a
df=pd.DataFrame(a)
print df
You can try read_csv and concat:
dateparse = lambda x: pd.datetime.strptime(x, '%Y %m %d %H%M')
#df1 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp1),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse)
print df1.head()
datetime 4 5 6 7
0 2016-02-03 00:00:00 57421 0 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 57421 60 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 57421 120 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 57421 180 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 57421 240 2.510000e-09 4.560000e-07
#df = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse)
print df2.head()
datetime 4 5 6 7
0 2016-02-04 00:00:00 57422 0 4.850000e-09 5.280000e-07
1 2016-02-04 00:01:00 57422 60 3.070000e-09 5.290000e-07
2 2016-02-04 00:02:00 57422 120 4.480000e-09 5.260000e-07
3 2016-02-04 00:03:00 57422 180 3.210000e-09 5.170000e-07
4 2016-02-04 00:04:00 57422 240 4.230000e-09 5.180000e-07
df = pd.concat([df1[['datetime',6,7]],df2[['datetime',6,7]]])
df.columns = ['datetime','xra','xrb']
print df.head(10)
datetime xra xrb
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 2.510000e-09 4.560000e-07
5 2016-02-03 00:05:00 4.300000e-09 4.480000e-07
6 2016-02-03 00:06:00 1.970000e-09 4.470000e-07
7 2016-02-03 00:07:00 2.460000e-09 4.470000e-07
8 2016-02-03 00:08:00 3.100000e-09 4.510000e-07
9 2016-02-03 00:09:00 3.240000e-09 4.430000e-07
EDIT:
Also you can use parameter usecols in read_csv for filtering columns - you need only columns datetime, 6 and 7. Then you can use all df1 and df2 in pd.concat:
#df1 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
df1 = pd.read_csv(io.StringIO(temp1),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse,
usecols=[0, 1, 2, 3, 6, 7])
print df1.head()
datetime 6 7
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
#df2 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse,
usecols=[0, 1, 2, 3, 6, 7])
print df2.head()
datetime 6 7
0 2016-02-04 00:00:00 4.850000e-09 5.280000e-07
1 2016-02-04 00:01:00 3.070000e-09 5.290000e-07
2 2016-02-04 00:02:00 4.480000e-09 5.260000e-07
3 2016-02-04 00:03:00 3.210000e-09 5.170000e-07
4 2016-02-04 00:04:00 4.230000e-09 5.180000e-07
df = pd.concat([df1,df2])
df.columns = ['datetime','xra','xrb']
print df
datetime xra xrb
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 2.510000e-09 4.560000e-07
5 2016-02-03 00:05:00 4.300000e-09 4.480000e-07
6 2016-02-03 00:06:00 1.970000e-09 4.470000e-07
7 2016-02-03 00:07:00 2.460000e-09 4.470000e-07
8 2016-02-03 00:08:00 3.100000e-09 4.510000e-07
9 2016-02-03 00:09:00 3.240000e-09 4.430000e-07
Related
How to groupby multiple columns and unstack get percentage of each cell by dividing from row total in Python?
My Question is as follows, i have a data set ~ 700mb which looks like rpt_period_name_week period_name_mth assigned_date_utc resolved_date_utc handle_seconds action marketplace_id login category currency_code order_amount_in_usd day_of_week_NewClmn 2020 Week 01 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 84 Pass DE a MRI AT EUR 81.32 Saturday 2020 Week 02 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 37 Pass DE b MRI AQ EUR 222.38 Saturday 2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 123 Pass DE a MRI DG EUR 444.77 Saturday 2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:59 313 Hold JP a MRI AQ Saturday 2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 112 Pass FR b MRI DG EUR 582.53 Saturday 2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:58 249 Pass DE f MRI AT EUR 443.16 Saturday 2020 Week 03 2020 / 01 1/11/2020 23:58 1/11/2020 23:58 48 Pass DE b MRI DG EUR 20.5 Saturday 2020 Week 03 2020 / 01 1/11/2020 23:57 1/11/2020 23:58 40 Pass IT a MRI AQ EUR 272.01 Saturday my desired output is like [Output][1] https://i.stack.imgur.com/8oz7G.png My code is below but i am unable to get the desire result? My cells are getting divided by sum of row? Have tried multiple options but in vain? df = data_final.groupby(['login','category','rpt_period_name_week','action'])['action'].agg(np.count_nonzero).unstack(['rpt_period_name_week','action']).apply(lambda x: x.fillna(0)) df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1)) # df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1)) df1 = df.astype(str) + '%' # print (df1) Please help?
Changing the date format of an entire Dataframe column when multiple date formats already exist in the column?
bond_df['Maturity'] 0 2022-07-15 00:00:00 1 2024-07-18 00:00:00 2 2027-07-16 00:00:00 3 2020-07-28 00:00:00 4 2019-10-09 00:00:00 5 2022-04-08 00:00:00 6 2020-12-15 00:00:00 7 2022-12-15 00:00:00 8 2026-04-08 00:00:00 9 2023-04-11 00:00:00 10 2024-12-15 00:00:00 11 2019 12 2020-10-25 00:00:00 13 2024-04-22 00:00:00 14 2047-12-15 00:00:00 15 2020-07-08 00:00:00 17 2043-04-11 00:00:00 18 2021 19 2022 20 2023 21 2025 22 2026 23 2027 24 2029 25 2021-04-15 00:00:00 26 2044-04-22 00:00:00 27 2043-10-02 00:00:00 28 2039-01-19 00:00:00 29 2040-07-09 00:00:00 30 2029-09-21 00:00:00 31 2040-10-25 00:00:00 32 2019 33 2035-09-04 00:00:00 34 2035-09-28 00:00:00 35 2041-04-15 00:00:00 36 2040-04-02 00:00:00 37 2034-03-27 00:00:00 38 2030 39 2027-04-05 00:00:00 40 2038-04-15 00:00:00 41 2037-08-17 00:00:00 42 2023-10-16 00:00:00 43 - 45 2019-10-09 00:00:00 46 - 47 2021-06-23 00:00:00 48 2021-06-23 00:00:00 49 2023-06-26 00:00:00 50 2025-06-26 00:00:00 51 2028-06-26 00:00:00 52 2038-06-28 00:00:00 53 2020-06-23 00:00:00 54 2020-06-23 00:00:00 55 2048-06-29 00:00:00 56 - 57 - 58 2029-07-08 00:00:00 59 2026-07-08 00:00:00 60 2024-07-08 00:00:00 61 2020-07-31 00:00:00 Name: Maturity, dtype: object This is a column of data that I imported from Excel of maturity dates for various Walmart bonds. All I am concerned with is the year portion of these dates. How can I format the entire column to just return the year values? dt.strftime didn't work Thanks in advance
Wrote this little script for you which should output the years in a years.txt file, assuming your data is in data.txt as only the years you posted above. Script also lets you toggle if you want to include the dash and the years on the right. Contents of the data.txt I tested with: 0 2022-07-15 00:00:00 1 2024-07-18 00:00:00 2 2027-07-16 00:00:00 3 2020-07-28 00:00:00 4 2019-10-09 00:00:00 5 2022-04-08 00:00:00 6 2020-12-15 00:00:00 7 2022-12-15 00:00:00 8 2026-04-08 00:00:00 9 2023-04-11 00:00:00 10 2024-12-15 00:00:00 11 2019 12 2020-10-25 00:00:00 13 2024-04-22 00:00:00 14 2047-12-15 00:00:00 15 2020-07-08 00:00:00 17 2043-04-11 00:00:00 18 2021 19 2022 20 2023 21 2025 22 2026 23 2027 24 2029 25 2021-04-15 00:00:00 26 2044-04-22 00:00:00 27 2043-10-02 00:00:00 28 2039-01-19 00:00:00 29 2040-07-09 00:00:00 30 2029-09-21 00:00:00 31 2040-10-25 00:00:00 32 2019 33 2035-09-04 00:00:00 34 2035-09-28 00:00:00 35 2041-04-15 00:00:00 36 2040-04-02 00:00:00 37 2034-03-27 00:00:00 38 2030 39 2027-04-05 00:00:00 40 2038-04-15 00:00:00 41 2037-08-17 00:00:00 42 2023-10-16 00:00:00 43 - 45 2019-10-09 00:00:00 46 - 47 2021-06-23 00:00:00 48 2021-06-23 00:00:00 49 2023-06-26 00:00:00 50 2025-06-26 00:00:00 51 2028-06-26 00:00:00 52 2038-06-28 00:00:00 53 2020-06-23 00:00:00 54 2020-06-23 00:00:00 55 2048-06-29 00:00:00 56 - 57 - 58 2029-07-08 00:00:00 59 2026-07-08 00:00:00 60 2024-07-08 00:00:00 61 2020-07-31 00:00:00 and the script I wrote: #!/usr/bin/python3 all_years = [] include_dash = False include_years_on_right = True with open("data.txt", "r") as f: text = f.read() lines = text.split("\n") for line in lines: line = line.strip() if line == "": continue if "00" in line: all_years.append(line.split("-")[0].split()[-1]) else: if include_years_on_right == False: continue year = line.split(" ")[-1] if year == "-": if include_dash == True: all_years.append(year) else: continue else: all_years.append(year) with open("years.txt", "w") as f: for year in all_years: f.write(year + "\n") and the output to the years.txt: 2022 2024 2027 2020 2019 2022 2020 2022 2026 2023 2024 2019 2020 2024 2047 2020 2043 2021 2022 2023 2025 2026 2027 2029 2021 2044 2043 2039 2040 2029 2040 2019 2035 2035 2041 2040 2034 2030 2027 2038 2037 2023 2019 2021 2021 2023 2025 2028 2038 2020 2020 2048 2029 2026 2024 2020 Contact me if you have any issues, and I hope I can help you!
Divide Single to Multiple Columns by Delimiter Python Dataframe
i have a dataframe called "dates" with shape 4380,1 that looks like this - date 0 2017-01-01 00:00:00 1 2017-01-01 06:00:00 2 2017-01-01 12:00:00 3 2017-01-01 18:00:00 4 2017-01-02 00:00:00 ... 4375 2019-12-30 18:00:00 4376 2019-12-31 00:00:00 4377 2019-12-31 06:00:00 4378 2019-12-31 12:00:00 4379 2019-12-31 18:00:00 but i need to divide the single column of dates by the delimiter "-" or dash so that I can use this to groupby the month e.g., 01, 02,...12. So, my final result for the new dataframe should have shape 4380,4 and look like: Year Month Day HHMMSS 0 2017 01 01 00:00:00 1 2017 01 01 06:00:00 ... 4379 2019 12 31 18:00:00 I cannot find how to do this python transformation from single to multiple columns based on a delimiter. Thank you much!
Use Series.dt.strftime and Series.str.split: new_df = df['date'].dt.strftime('%Y-%m-%d-%H:%M:%S').str.split('-',expand=True) new_df.columns = ['Year','Month','Day', 'HHMMSS'] print(new_df) Year Month Day HHMMSS 0 2017 01 01 00:00:00 1 2017 01 01 06:00:00 2 2017 01 01 12:00:00 3 2017 01 01 18:00:00 4 2017 01 02 00:00:00 4375 2019 12 30 18:00:00 4376 2019 12 31 00:00:00 4377 2019 12 31 06:00:00 4378 2019 12 31 12:00:00 4379 2019 12 31 18:00:00
Adding missing rows of pandas DataFrame when index contains duplicate data
I have a DataFrame with dtype=object as: YY MM DD hh var1 var2 . . . 10512 2013 01 01 06 1.64 4.64 10513 2013 01 01 07 1.57 4.63 10514 2013 01 01 08 1.56 4.71 10515 2013 01 01 09 1.45 4.69 10516 2013 01 01 10 1.53 4.67 10517 2013 01 01 11 1.31 4.63 10518 2013 01 01 12 1.41 4.70 10519 2013 01 01 13 1.49 4.80 10520 2013 01 01 20 1.15 4.91 10521 2013 01 01 21 1.14 4.74 10522 2013 01 01 22 1.10 4.95 As seen, there are missing rows corresponding to hours (hh) (for instance between 10519 and 10520 rows, hh jumps from 13 to 20). I tried to add the gap by setting hh as index, as what was discussed here: Missing data, insert rows in Pandas and fill with NAN df=df.set_index('hh') new_index = pd.Index(np.arange(0,24), name="hh") df=df.reindex(new_index).reset_index() and reach something like: YY MM DD hh var1 var2 10519 2013 01 01 13 1.49 4.80 10520 2013 01 01 14 Nan Nan 10521 2013 01 01 15 Nan Nan 10522 2013 01 01 16 Nan Nan ... 10523 2013 01 01 20 1.15 4.91 10524 2013 01 01 21 1.14 4.74 10525 2013 01 01 22 1.10 4.95 But I encounter the error "cannot reindex from a duplicate axis" for the part df=df.reindex(new_index). There are duplicate values for each hh=0,1,...,23, because same value of hh would be repeated for different months (MM) and years (YY). Probably that's the reason. How can I solve the problem? In general,how can one fills the missing rows of pandas DataFrame when index contains duplicate data. I appreciate any comments.
First create a new column with the time, including date and hour, of type datetime. One way this can be done is as follows: df = df.rename(columns={'YY': 'year', 'MM': 'month', 'DD': 'day', 'hh': 'hour'}) df['time'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']]) To use to_datetime in this way, the column names need to be year, month, day and hour which is why rename is used. To get the expected result, set this new column as the index and use resample: df.set_index('time').resample('H').mean()
This code does exactly what you need. import pandas as pd import numpy as np from io import StringIO YY, MM, DD, hh, var1, var2 = [],[],[],[],[],[] a = '''10512 2013 01 01 06 1.64 4.64 10513 2013 01 01 07 1.57 4.63 10514 2013 01 01 08 1.56 4.71 10515 2013 01 01 09 1.45 4.69 10516 2013 01 01 10 1.53 4.67 10517 2013 01 01 11 1.31 4.63 10518 2013 01 01 12 1.41 4.70 10519 2013 01 01 13 1.49 4.80 10520 2013 01 01 20 1.15 4.91 10521 2013 01 01 21 1.14 4.74 10522 2013 01 01 22 1.10 4.95 10523 2013 01 01 27 1.30 4.55 10524 2013 01 01 28 1.2 4.62 ''' text = StringIO(a) for line in text.readlines(): a = line.strip().split(" ") a = list(filter(None, a)) YY.append(a[1]) MM.append(a[2]) DD.append(a[3]) hh.append(a[4]) var1.append(a[5]) var2.append(a[6]) df = pd.DataFrame({'YY':YY, 'MM':MM, 'DD':DD, 'hh':hh, 'var1':var1, 'var2':var2}) df['hh'] = df.hh.astype(int) a = np.diff(df.hh) b = np.where(a!=1) df2 = df.copy(deep=True) for i in range(len(df)): if (i in b[0]): line = pd.DataFrame(columns=['YY', 'MM', 'DD', 'hh', 'var1', 'var2']) for k in range(a[i]-1): line.loc[k]=[df2.iloc[i, 0], df2.iloc[i, 1], df2.iloc[i, 2], df2.iloc[i, 3]+k+1 , np.nan, np.nan] df = pd.concat([df.loc[:i], line, df.loc[i+1:]]) df.reset_index(inplace=True, drop=True) print(df) YY MM DD hh var1 var2 0 2013 01 01 6 1.64 4.64 1 2013 01 01 7 1.57 4.63 2 2013 01 01 8 1.56 4.71 3 2013 01 01 9 1.45 4.69 4 2013 01 01 10 1.53 4.67 5 2013 01 01 11 1.31 4.63 6 2013 01 01 12 1.41 4.70 7 2013 01 01 13 1.49 4.80 8 2013 01 01 14 NaN NaN 9 2013 01 01 15 NaN NaN 10 2013 01 01 16 NaN NaN 11 2013 01 01 17 NaN NaN 12 2013 01 01 18 NaN NaN 13 2013 01 01 19 NaN NaN 14 2013 01 01 20 1.15 4.91 15 2013 01 01 21 1.14 4.74 16 2013 01 01 22 1.10 4.95 17 2013 01 01 23 NaN NaN 18 2013 01 01 24 NaN NaN 19 2013 01 01 25 NaN NaN 20 2013 01 01 26 NaN NaN 21 2013 01 01 27 1.30 4.55 22 2013 01 01 28 1.2 4.62
Python: Grouping by date and finding the average of a column inside a dataframe
I have a data frame that has a 3 columns. Time represents every day of the month for various months. what I am trying to do is get the 'Count' value per day and average it per each month, and do this for each country. The output must be in the form of a data frame. Curent data: Time Country Count 2017-01-01 us 7827 2017-01-02 us 7748 2017-01-03 us 7653 .. .. 2017-01-30 us 5432 2017-01-31 us 2942 2017-01-01 us 5829 2017-01-02 ca 9843 2017-01-03 ca 7845 .. .. 2017-01-30 ca 8654 2017-01-31 ca 8534 Desire output (dummy data, numbers are not representative of the DF above): Time Country Monthly Average Jan 2017 us 6873 Feb 2017 us 8875 .. .. Nov 2017 us 9614 Dec 2017 us 2475 Jan 2017 ca 1878 Feb 2017 ca 4775 .. .. Nov 2017 ca 7643 Dec 2017 ca 9441
I'd organize it like this: df.groupby( [df.Time.dt.strftime('%b %Y'), 'Country'] )['Count'].mean().reset_index(name='Monthly Average') Time Country Monthly Average 0 Feb 2017 ca 88.0 1 Feb 2017 us 105.0 2 Jan 2017 ca 85.0 3 Jan 2017 us 24.6 4 Mar 2017 ca 86.0 5 Mar 2017 us 54.0 If your 'Time' column wasn't already a datetime column, I'd do this: df.groupby( [pd.to_datetime(df.Time).dt.strftime('%b %Y'), 'Country'] )['Count'].mean().reset_index(name='Monthly Average') Time Country Monthly Average 0 Feb 2017 ca 88.0 1 Feb 2017 us 105.0 2 Jan 2017 ca 85.0 3 Jan 2017 us 24.6 4 Mar 2017 ca 86.0 5 Mar 2017 us 54.0
Use pandas dt strftime to create a month-year column that you desire + groupby + mean. Used this dataframe: Dated country num 2017-01-01 us 12 2017-01-02 us 12 2017-02-02 us 134 2017-02-03 us 76 2017-03-30 us 54 2017-01-31 us 29 2017-01-01 us 58 2017-01-02 us 12 2017-02-02 ca 98 2017-02-03 ca 78 2017-03-30 ca 86 2017-01-31 ca 85 Then create a Month-Year column: a['MonthYear']= a.Dated.dt.strftime('%b %Y') Then, drop the Date column and aggregate by mean: a.drop('Dated', axis=1).groupby(['MonthYear','country']).mean().rename(columns={'num':'Averaged'}).reset_index() MonthYear country Averaged Feb 2017 ca 88.0 Feb 2017 us 105.0 Jan 2017 ca 85.0 Jan 2017 us 24.6 Mar 2017 ca 86.0 Mar 2017 us 54.0 I retained the Dated column just in case.