Adding missing rows of pandas DataFrame when index contains duplicate data - python

I have a DataFrame with dtype=object as:
YY MM DD hh var1 var2
.
.
.
10512 2013 01 01 06 1.64 4.64
10513 2013 01 01 07 1.57 4.63
10514 2013 01 01 08 1.56 4.71
10515 2013 01 01 09 1.45 4.69
10516 2013 01 01 10 1.53 4.67
10517 2013 01 01 11 1.31 4.63
10518 2013 01 01 12 1.41 4.70
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 20 1.15 4.91
10521 2013 01 01 21 1.14 4.74
10522 2013 01 01 22 1.10 4.95
As seen, there are missing rows corresponding to hours (hh) (for instance between 10519 and 10520 rows, hh jumps from 13 to 20). I tried to add the gap by setting hh as index, as what was discussed here: Missing data, insert rows in Pandas and fill with NAN
df=df.set_index('hh')
new_index = pd.Index(np.arange(0,24), name="hh")
df=df.reindex(new_index).reset_index()
and reach something like:
YY MM DD hh var1 var2
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 14 Nan Nan
10521 2013 01 01 15 Nan Nan
10522 2013 01 01 16 Nan Nan
...
10523 2013 01 01 20 1.15 4.91
10524 2013 01 01 21 1.14 4.74
10525 2013 01 01 22 1.10 4.95
But I encounter the error "cannot reindex from a duplicate axis" for the part df=df.reindex(new_index).
There are duplicate values for each hh=0,1,...,23, because same value of hh would be repeated for different months (MM) and years (YY).
Probably that's the reason. How can I solve the problem?
In general,how can one fills the missing rows of pandas DataFrame when index contains duplicate data. I appreciate any comments.

First create a new column with the time, including date and hour, of type datetime. One way this can be done is as follows:
df = df.rename(columns={'YY': 'year', 'MM': 'month', 'DD': 'day', 'hh': 'hour'})
df['time'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
To use to_datetime in this way, the column names need to be year, month, day and hour which is why rename is used.
To get the expected result, set this new column as the index and use resample:
df.set_index('time').resample('H').mean()

This code does exactly what you need.
import pandas as pd
import numpy as np
from io import StringIO
YY, MM, DD, hh, var1, var2 = [],[],[],[],[],[]
a = '''10512 2013 01 01 06 1.64 4.64
10513 2013 01 01 07 1.57 4.63
10514 2013 01 01 08 1.56 4.71
10515 2013 01 01 09 1.45 4.69
10516 2013 01 01 10 1.53 4.67
10517 2013 01 01 11 1.31 4.63
10518 2013 01 01 12 1.41 4.70
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 20 1.15 4.91
10521 2013 01 01 21 1.14 4.74
10522 2013 01 01 22 1.10 4.95
10523 2013 01 01 27 1.30 4.55
10524 2013 01 01 28 1.2 4.62
'''
text = StringIO(a)
for line in text.readlines():
a = line.strip().split(" ")
a = list(filter(None, a))
YY.append(a[1])
MM.append(a[2])
DD.append(a[3])
hh.append(a[4])
var1.append(a[5])
var2.append(a[6])
df = pd.DataFrame({'YY':YY, 'MM':MM, 'DD':DD,
'hh':hh, 'var1':var1, 'var2':var2})
df['hh'] = df.hh.astype(int)
a = np.diff(df.hh)
b = np.where(a!=1)
df2 = df.copy(deep=True)
for i in range(len(df)):
if (i in b[0]):
line = pd.DataFrame(columns=['YY', 'MM', 'DD',
'hh', 'var1', 'var2'])
for k in range(a[i]-1):
line.loc[k]=[df2.iloc[i, 0], df2.iloc[i, 1],
df2.iloc[i, 2], df2.iloc[i, 3]+k+1 ,
np.nan, np.nan]
df = pd.concat([df.loc[:i],
line, df.loc[i+1:]])
df.reset_index(inplace=True, drop=True)
print(df)
YY MM DD hh var1 var2
0 2013 01 01 6 1.64 4.64
1 2013 01 01 7 1.57 4.63
2 2013 01 01 8 1.56 4.71
3 2013 01 01 9 1.45 4.69
4 2013 01 01 10 1.53 4.67
5 2013 01 01 11 1.31 4.63
6 2013 01 01 12 1.41 4.70
7 2013 01 01 13 1.49 4.80
8 2013 01 01 14 NaN NaN
9 2013 01 01 15 NaN NaN
10 2013 01 01 16 NaN NaN
11 2013 01 01 17 NaN NaN
12 2013 01 01 18 NaN NaN
13 2013 01 01 19 NaN NaN
14 2013 01 01 20 1.15 4.91
15 2013 01 01 21 1.14 4.74
16 2013 01 01 22 1.10 4.95
17 2013 01 01 23 NaN NaN
18 2013 01 01 24 NaN NaN
19 2013 01 01 25 NaN NaN
20 2013 01 01 26 NaN NaN
21 2013 01 01 27 1.30 4.55
22 2013 01 01 28 1.2 4.62

Related

Pandas groupby: get max value in a subgroup [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 7 months ago.
I have a large dataset grouped by column, row, year, potveg, and total. I am trying to get the max value of 'total' column in a specific year of a group. i.e., for the dataset below:
col row year potveg total
-125.0 42.5 2015 9 697.3
2015 13 535.2
2015 15 82.3
2016 9 907.8
2016 13 137.6
2016 15 268.4
2017 9 961.9
2017 13 74.2
2017 15 248.0
2018 9 937.9
2018 13 575.6
2018 15 215.5
-135.0 70.5 2015 8 697.3
2015 10 535.2
2015 19 82.3
2016 8 907.8
2016 10 137.6
2016 19 268.4
2017 8 961.9
2017 10 74.2
2017 19 248.0
2018 8 937.9
2018 10 575.6
2018 19 215.5
I would like the output to look like this:
col row year potveg total
-125.0 42.5 2015 9 697.3
2016 9 907.8
2017 9 961.9
2018 9 937.9
-135.0 70.5 2015 8 697.3
2016 8 907.8
2017 8 961.9
2018 8 937.9
I tried this:
df.groupby(['col', 'row', 'year', 'potveg']).agg({'total': 'max'})
and this:
df.groupby(['col', 'row', 'year', 'potveg'])['total'].max()
but they do not seem to work because the output has too many rows.
I think the issue is the 'potveg' column which is a subgroup. I am not sure how to select rows containing max value of 'total'.
One possible solution, using .idxmax() inside groupby.apply:
print(
df.groupby(["col", "row", "year"], as_index=False, sort=False).apply(
lambda x: x.loc[x["total"].idxmax()]
)
)
Prints:
col row year potveg total
0 -125.0 42.5 2015.0 9.0 697.3
1 -125.0 42.5 2016.0 9.0 907.8
2 -125.0 42.5 2017.0 9.0 961.9
3 -125.0 42.5 2018.0 9.0 937.9
4 -135.0 70.5 2015.0 8.0 697.3
5 -135.0 70.5 2016.0 8.0 907.8
6 -135.0 70.5 2017.0 8.0 961.9
7 -135.0 70.5 2018.0 8.0 937.9
DataFrame used:
col row year potveg total
0 -125.0 42.5 2015 9 697.3
1 -125.0 42.5 2015 13 535.2
2 -125.0 42.5 2015 15 82.3
3 -125.0 42.5 2016 9 907.8
4 -125.0 42.5 2016 13 137.6
5 -125.0 42.5 2016 15 268.4
6 -125.0 42.5 2017 9 961.9
7 -125.0 42.5 2017 13 74.2
8 -125.0 42.5 2017 15 248.0
9 -125.0 42.5 2018 9 937.9
10 -125.0 42.5 2018 13 575.6
11 -125.0 42.5 2018 15 215.5
12 -135.0 70.5 2015 8 697.3
13 -135.0 70.5 2015 10 535.2
14 -135.0 70.5 2015 19 82.3
15 -135.0 70.5 2016 8 907.8
16 -135.0 70.5 2016 10 137.6
17 -135.0 70.5 2016 19 268.4
18 -135.0 70.5 2017 8 961.9
19 -135.0 70.5 2017 10 74.2
20 -135.0 70.5 2017 19 248.0
21 -135.0 70.5 2018 8 937.9
22 -135.0 70.5 2018 10 575.6
23 -135.0 70.5 2018 19 215.5
Option 1: One way is the do the groupby() and then merge with the original df
df1 = pd.merge(df.groupby(['col','row','year']).agg({'total':'max'}).reset_index(),
df,
on=['col', 'row', 'year', 'total'])
print(df1)
Output:
col row year total potveg
0 -125.0 42.5 2015 697.3 9
1 -125.0 42.5 2016 907.8 9
2 -125.0 42.5 2017 961.9 9
3 -125.0 42.5 2018 937.9 9
4 -135.0 70.5 2015 697.3 8
5 -135.0 70.5 2016 907.8 8
6 -135.0 70.5 2017 961.9 8
7 -135.0 70.5 2018 937.9 8
Option 2: Or the use of sort_values() and drop_duplicates() like this:
df1 = df.sort_values(['col','row','year']).drop_duplicates(['col','row','year'], keep='first')
print(df1)
Output:
col row year potveg total
0 -125.0 42.5 2015 9 697.3
3 -125.0 42.5 2016 9 907.8
6 -125.0 42.5 2017 9 961.9
9 -125.0 42.5 2018 9 937.9
12 -135.0 70.5 2015 8 697.3
15 -135.0 70.5 2016 8 907.8
18 -135.0 70.5 2017 8 961.9
21 -135.0 70.5 2018 8 937.9

How to groupby multiple columns and unstack get percentage of each cell by dividing from row total in Python?

My Question is as follows, i have a data set ~ 700mb which looks like
rpt_period_name_week period_name_mth assigned_date_utc resolved_date_utc handle_seconds action marketplace_id login category currency_code order_amount_in_usd day_of_week_NewClmn
2020 Week 01 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 84 Pass DE a MRI AT EUR 81.32 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 37 Pass DE b MRI AQ EUR 222.38 Saturday
2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 123 Pass DE a MRI DG EUR 444.77 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:59 313 Hold JP a MRI AQ Saturday
2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 112 Pass FR b MRI DG EUR 582.53 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:58 249 Pass DE f MRI AT EUR 443.16 Saturday
2020 Week 03 2020 / 01 1/11/2020 23:58 1/11/2020 23:58 48 Pass DE b MRI DG EUR 20.5 Saturday
2020 Week 03 2020 / 01 1/11/2020 23:57 1/11/2020 23:58 40 Pass IT a MRI AQ EUR 272.01 Saturday
my desired output is like
[Output][1]
https://i.stack.imgur.com/8oz7G.png
My code is below but i am unable to get the desire result? My cells are getting divided by sum of row?
Have tried multiple options but in vain?
df = data_final.groupby(['login','category','rpt_period_name_week','action'])['action'].agg(np.count_nonzero).unstack(['rpt_period_name_week','action']).apply(lambda x: x.fillna(0))
df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
# df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
df1 = df.astype(str) + '%'
# print (df1)
Please help?

Transpose multiple rows of data in panda df [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 3 years ago.
I have the below table which shows rainfall by month in the UK across a number of years. I want to transpose it so that each row is one month/year and the data is chronological.
Year JAN FEB MAR APR
2010 79.7 74.8 79.4 48
2011 102.8 114.5 49.7 36.7
2012 110.9 60 37 128
2013 110.5 59.8 64.6 63.6
I would like it so the table looks like the below with year, month & rainfall as the columns:
2010 JAN 79.7
2010 FEB 74.8
2010 MAR 79.4
2010 APR 48
2011 JAN 102.8
2011 FEB 114.5
I think I need to use a for loop and iterate through each row to create a new dataframe but I'm not sure of the syntax. I've tried the below loop which nearly does what I want but doesn't output as a dataframe.
for index, row in weather.iterrows():
print(row["Year"],row)
2014.0 Year 2014.0
JAN 188.0
FEB 169.2
MAR 80.0
APR 67.8
MAY 99.6
JUN 54.8
JUL 64.7
Any help would be appreciated.
You should avoid using for-loops and instead use stack.
df.set_index('Year') \
.stack() \
.reset_index() \
.rename(columns={'level_1': 'Month', 0: 'Amount'})
Year Month Amount
0 2010 JAN 79.7
1 2010 FEB 74.8
2 2010 MAR 79.4
3 2010 APR 48.0
4 2011 JAN 102.8
5 2011 FEB 114.5
6 2011 MAR 49.7
7 2011 APR 36.7
8 2012 JAN 110.9
9 2012 FEB 60.0
etc...

Python: Grouping by date and finding the average of a column inside a dataframe

I have a data frame that has a 3 columns.
Time represents every day of the month for various months. what I am trying to do is get the 'Count' value per day and average it per each month, and do this for each country. The output must be in the form of a data frame.
Curent data:
Time Country Count
2017-01-01 us 7827
2017-01-02 us 7748
2017-01-03 us 7653
..
..
2017-01-30 us 5432
2017-01-31 us 2942
2017-01-01 us 5829
2017-01-02 ca 9843
2017-01-03 ca 7845
..
..
2017-01-30 ca 8654
2017-01-31 ca 8534
Desire output (dummy data, numbers are not representative of the DF above):
Time Country Monthly Average
Jan 2017 us 6873
Feb 2017 us 8875
..
..
Nov 2017 us 9614
Dec 2017 us 2475
Jan 2017 ca 1878
Feb 2017 ca 4775
..
..
Nov 2017 ca 7643
Dec 2017 ca 9441
I'd organize it like this:
df.groupby(
[df.Time.dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0
If your 'Time' column wasn't already a datetime column, I'd do this:
df.groupby(
[pd.to_datetime(df.Time).dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0
Use pandas dt strftime to create a month-year column that you desire + groupby + mean. Used this dataframe:
Dated country num
2017-01-01 us 12
2017-01-02 us 12
2017-02-02 us 134
2017-02-03 us 76
2017-03-30 us 54
2017-01-31 us 29
2017-01-01 us 58
2017-01-02 us 12
2017-02-02 ca 98
2017-02-03 ca 78
2017-03-30 ca 86
2017-01-31 ca 85
Then create a Month-Year column:
a['MonthYear']= a.Dated.dt.strftime('%b %Y')
Then, drop the Date column and aggregate by mean:
a.drop('Dated', axis=1).groupby(['MonthYear','country']).mean().rename(columns={'num':'Averaged'}).reset_index()
MonthYear country Averaged
Feb 2017 ca 88.0
Feb 2017 us 105.0
Jan 2017 ca 85.0
Jan 2017 us 24.6
Mar 2017 ca 86.0
Mar 2017 us 54.0
I retained the Dated column just in case.

pandas create dataframe from two files

I have two txt files...Structure first one is:
:Data_list: 20160203_Gs_xr_1m.txt
:Created: 2016 Feb 04 0010 UTC
# Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
# Please send comments and suggestions to SWPC.Webmaster#noaa.gov
#
# Label: Short = 0.05- 0.4 nanometer
# Label: Long = 0.1 - 0.8 nanometer
# Units: Short = Watts per meter squared
# Units: Long = Watts per meter squared
# Source: GOES-13
# Location: W075
# Missing data: -1.00e+05
#
# 1-minute GOES-13 Solar X-ray Flux
#
# Modified Seconds
# UTC Date Time Julian of the
# YR MO DA HHMM Day Day Short Long
#-------------------------------------------------------
2016 02 03 0000 57421 0 2.13e-09 4.60e-07
2016 02 03 0001 57421 60 1.84e-09 4.51e-07
2016 02 03 0002 57421 120 1.79e-09 4.52e-07
2016 02 03 0003 57421 180 1.58e-09 4.58e-07
2016 02 03 0004 57421 240 2.51e-09 4.56e-07
2016 02 03 0005 57421 300 4.30e-09 4.48e-07
2016 02 03 0006 57421 360 1.97e-09 4.47e-07
2016 02 03 0007 57421 420 2.46e-09 4.47e-07
2016 02 03 0008 57421 480 3.10e-09 4.51e-07
2016 02 03 0009 57421 540 3.24e-09 4.43e-07
2016 02 03 0010 57421 600 2.92e-09 4.34e-07
2016 02 03 0011 57421 660 2.42e-09 4.35e-07
2016 02 03 0012 57421 720 2.90e-09 4.40e-07
2016 02 03 0013 57421 780 1.87e-09 4.36e-07
2016 02 03 0014 57421 840 1.31e-09 4.37e-07
2016 02 03 0015 57421 900 2.50e-09 4.41e-07
2016 02 03 0016 57421 960 1.52e-09 4.42e-07
2016 02 03 0017 57421 1020 1.36e-09 4.41e-07
2016 02 03 0018 57421 1080 1.33e-09 4.35e-07
2016 02 03 0019 57421 1140 2.20e-09 4.37e-07
2016 02 03 0020 57421 1200 2.90e-09 4.53e-07
2016 02 03 0021 57421 1260 1.39e-09 4.75e-07
2016 02 03 0022 57421 1320 4.55e-09 4.67e-07
2016 02 03 0023 57421 1380 2.30e-09 4.58e-07
2016 02 03 0024 57421 1440 3.99e-09 4.53e-07
2016 02 03 0025 57421 1500 3.93e-09 4.40e-07
2016 02 03 0026 57421 1560 1.70e-09 4.34e-07
.
.
.
2016 02 03 2344 57421 85440 3.77e-09 5.00e-07
2016 02 03 2345 57421 85500 3.76e-09 4.96e-07
2016 02 03 2346 57421 85560 1.64e-09 4.97e-07
2016 02 03 2347 57421 85620 3.59e-09 5.04e-07
2016 02 03 2348 57421 85680 2.55e-09 5.04e-07
2016 02 03 2349 57421 85740 2.30e-09 5.11e-07
2016 02 03 2350 57421 85800 2.95e-09 5.09e-07
2016 02 03 2351 57421 85860 4.25e-09 5.02e-07
2016 02 03 2352 57421 85920 4.78e-09 5.02e-07
2016 02 03 2353 57421 85980 3.04e-09 5.01e-07
2016 02 03 2354 57421 86040 3.30e-09 5.10e-07
2016 02 03 2355 57421 86100 2.22e-09 5.16e-07
2016 02 03 2356 57421 86160 4.12e-09 5.15e-07
2016 02 03 2357 57421 86220 4.25e-09 5.16e-07
2016 02 03 2358 57421 86280 3.48e-09 5.20e-07
2016 02 03 2359 57421 86340 4.19e-09 5.27e-07
And second one:
:Data_list: 20160204_Gs_xr_1m.txt
:Created: 2016 Feb 04 1301 UTC
# Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
# Please send comments and suggestions to SWPC.Webmaster#noaa.gov
#
# Label: Short = 0.05- 0.4 nanometer
# Label: Long = 0.1 - 0.8 nanometer
# Units: Short = Watts per meter squared
# Units: Long = Watts per meter squared
# Source: GOES-13
# Location: W075
# Missing data: -1.00e+05
#
# 1-minute GOES-13 Solar X-ray Flux
#
# Modified Seconds
# UTC Date Time Julian of the
# YR MO DA HHMM Day Day Short Long
#-------------------------------------------------------
2016 02 04 0000 57422 0 4.85e-09 5.28e-07
2016 02 04 0001 57422 60 3.07e-09 5.29e-07
2016 02 04 0002 57422 120 4.48e-09 5.26e-07
2016 02 04 0003 57422 180 3.21e-09 5.17e-07
2016 02 04 0004 57422 240 4.23e-09 5.18e-07
2016 02 04 0005 57422 300 4.55e-09 5.21e-07
2016 02 04 0006 57422 360 3.30e-09 5.31e-07
2016 02 04 0007 57422 420 5.29e-09 5.49e-07
2016 02 04 0008 57422 480 3.14e-09 5.65e-07
2016 02 04 0009 57422 540 6.59e-09 5.70e-07
2016 02 04 0010 57422 600 6.04e-09 5.62e-07
2016 02 04 0011 57422 660 5.31e-09 5.62e-07
2016 02 04 0012 57422 720 6.04e-09 5.46e-07
2016 02 04 0013 57422 780 6.81e-09 5.51e-07
2016 02 04 0014 57422 840 6.59e-09 5.65e-07
2016 02 04 0015 57422 900 5.81e-09 5.62e-07
2016 02 04 0016 57422 960 4.63e-09 5.59e-07
2016 02 04 0017 57422 1020 3.05e-09 5.51e-07
2016 02 04 0018 57422 1080 3.26e-09 5.46e-07
2016 02 04 0019 57422 1140 4.59e-09 5.50e-07
2016 02 04 0020 57422 1200 3.38e-09 5.39e-07
2016 02 04 0021 57422 1260 2.43e-09 5.37e-07
2016 02 04 0022 57422 1320 5.31e-09 5.60e-07
2016 02 04 0023 57422 1380 5.63e-09 5.51e-07
2016 02 04 0024 57422 1440 5.18e-09 5.50e-07
2016 02 04 0025 57422 1500 7.06e-09 5.59e-07
2016 02 04 0026 57422 1560 5.01e-09 5.76e-07
2016 02 04 0027 57422 1620 7.17e-09 5.63e-07
2016 02 04 0028 57422 1680 5.74e-09 5.58e-07
2016 02 04 0029 57422 1740 5.55e-09 5.62e-07
2016 02 04 0030 57422 1800 4.99e-09 5.47e-07
2016 02 04 0031 57422 1860 5.49e-09 5.42e-07
2016 02 04 0032 57422 1920 2.14e-09 5.32e-07
2016 02 04 0033 57422 1980 2.48e-09 5.21e-07
2016 02 04 0034 57422 2040 4.35e-09 5.18e-07
2016 02 04 0035 57422 2100 4.84e-09 5.13e-07
2016 02 04 0036 57422 2160 3.12e-09 5.05e-07
2016 02 04 0037 57422 2220 1.18e-09 4.99e-07
2016 02 04 0038 57422 2280 1.59e-09 4.95e-07
Now I need create Pandas dataframe and name three columns...First-time that will be YYYY MM DD HHMM, second xra-penultimate column and xrb-last column...and I need find max of xrb with time ... I think that I know how find max with index with pandas but I dont know how create pandas dataframe...I have problem with 'Header' to 19th line...I need create dataframe from two files without header - only data. And is there any method how read data from some time to some time (time range)?
Thanks for help
edit:
I have this script:
import urllib2
import sys
import datetime
import pandas as pd
xray_flux = urllib2.urlopen('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt')
flux=xray_flux.read()
dataflux= open('xray_flux.txt','w')
dataflux.write(flux)
dataflux.close()
a=pd.read_csv("xray_flux.txt",header=None, sep=" ",error_bad_lines=False,skiprows=19)
print a
df=pd.DataFrame(a)
print df
You can try read_csv and concat:
dateparse = lambda x: pd.datetime.strptime(x, '%Y %m %d %H%M')
#df1 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp1),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse)
print df1.head()
datetime 4 5 6 7
0 2016-02-03 00:00:00 57421 0 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 57421 60 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 57421 120 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 57421 180 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 57421 240 2.510000e-09 4.560000e-07
#df = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse)
print df2.head()
datetime 4 5 6 7
0 2016-02-04 00:00:00 57422 0 4.850000e-09 5.280000e-07
1 2016-02-04 00:01:00 57422 60 3.070000e-09 5.290000e-07
2 2016-02-04 00:02:00 57422 120 4.480000e-09 5.260000e-07
3 2016-02-04 00:03:00 57422 180 3.210000e-09 5.170000e-07
4 2016-02-04 00:04:00 57422 240 4.230000e-09 5.180000e-07
df = pd.concat([df1[['datetime',6,7]],df2[['datetime',6,7]]])
df.columns = ['datetime','xra','xrb']
print df.head(10)
datetime xra xrb
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 2.510000e-09 4.560000e-07
5 2016-02-03 00:05:00 4.300000e-09 4.480000e-07
6 2016-02-03 00:06:00 1.970000e-09 4.470000e-07
7 2016-02-03 00:07:00 2.460000e-09 4.470000e-07
8 2016-02-03 00:08:00 3.100000e-09 4.510000e-07
9 2016-02-03 00:09:00 3.240000e-09 4.430000e-07
EDIT:
Also you can use parameter usecols in read_csv for filtering columns - you need only columns datetime, 6 and 7. Then you can use all df1 and df2 in pd.concat:
#df1 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
df1 = pd.read_csv(io.StringIO(temp1),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse,
usecols=[0, 1, 2, 3, 6, 7])
print df1.head()
datetime 6 7
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
#df2 = pd.read_csv('ftp://ftp.swpc.noaa.gov/pub/lists/xray/'+date+'_Gp_xr_1m.txt',
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
index_col=None,
skiprows=19,
parse_dates={'datetime': [0,1,2,3]},
header=None,
date_parser=dateparse,
usecols=[0, 1, 2, 3, 6, 7])
print df2.head()
datetime 6 7
0 2016-02-04 00:00:00 4.850000e-09 5.280000e-07
1 2016-02-04 00:01:00 3.070000e-09 5.290000e-07
2 2016-02-04 00:02:00 4.480000e-09 5.260000e-07
3 2016-02-04 00:03:00 3.210000e-09 5.170000e-07
4 2016-02-04 00:04:00 4.230000e-09 5.180000e-07
df = pd.concat([df1,df2])
df.columns = ['datetime','xra','xrb']
print df
datetime xra xrb
0 2016-02-03 00:00:00 2.130000e-09 4.600000e-07
1 2016-02-03 00:01:00 1.840000e-09 4.510000e-07
2 2016-02-03 00:02:00 1.790000e-09 4.520000e-07
3 2016-02-03 00:03:00 1.580000e-09 4.580000e-07
4 2016-02-03 00:04:00 2.510000e-09 4.560000e-07
5 2016-02-03 00:05:00 4.300000e-09 4.480000e-07
6 2016-02-03 00:06:00 1.970000e-09 4.470000e-07
7 2016-02-03 00:07:00 2.460000e-09 4.470000e-07
8 2016-02-03 00:08:00 3.100000e-09 4.510000e-07
9 2016-02-03 00:09:00 3.240000e-09 4.430000e-07

Categories