Create files from extracted content of one file - python

I have a large file which contains information based on the number of processes and benchmark case used. All this information is followed one after the other within the same file.
--
# Benchmarking Allgather
# #processes = 8
# ( 3592 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 1.77 2.07 1.97
2 1000 1.79 2.08 1.97
4 1000 1.79 2.07 1.98
8 1000 1.82 2.12 2.01
--
# Benchmarking Allgather
# #processes = 16
# ( 3584 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 2.34 2.85 2.73
2 1000 2.36 2.87 2.74
4 1000 2.38 2.90 2.76
8 1000 2.42 2.95 2.79
In order to quickly plot the information I was planning to create a file per independent content, for instance, with the information given above I would create two files called "Allgather_8" and "Allgather_16" and the expected content of these files would be:
$cat Allgather_8
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 1.77 2.07 1.97
2 1000 1.79 2.08 1.97
4 1000 1.79 2.07 1.98
8 1000 1.82 2.12 2.01
$cat Allgather_16
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 2.34 2.85 2.73
2 1000 2.36 2.87 2.74
4 1000 2.38 2.90 2.76
8 1000 2.42 2.95 2.79
I could then plot this with gnuplot or matplotlib.
What I have tried so far:
I have been using grep and awk to extract the content, which works for independent sections but I don't know how to automate this.
Any ideas?

awk '
/Benchmarking/ { close(out); out = $NF }
/#processes/ { out = out "_" $NF }
/^[[:space:]]/ { print > out }
' file

Related

In dataframe, how to speed up recognizing rows that have more than 5 consecutive previous values with same sign?

I have a dataframe like this.
val consecutive
0 0.0001 0.0
1 0.0008 0.0
2 -0.0001 0.0
3 0.0005 0.0
4 0.0008 0.0
5 0.0002 0.0
6 0.0012 0.0
7 0.0012 1.0
8 0.0007 1.0
9 0.0004 1.0
10 0.0002 1.0
11 0.0000 0.0
12 0.0015 0.0
13 -0.0005 0.0
14 -0.0003 0.0
15 0.0001 0.0
16 0.0001 0.0
17 0.0003 0.0
18 -0.0003 0.0
19 -0.0001 0.0
20 0.0000 0.0
21 0.0000 0.0
22 -0.0008 0.0
23 -0.0008 0.0
24 -0.0001 0.0
25 -0.0006 0.0
26 -0.0010 1.0
27 0.0002 0.0
28 -0.0003 0.0
29 -0.0008 0.0
30 -0.0010 0.0
31 -0.0003 0.0
32 -0.0005 1.0
33 -0.0012 1.0
34 -0.0002 1.0
35 0.0000 0.0
36 -0.0018 0.0
37 -0.0009 0.0
38 -0.0007 0.0
39 0.0000 0.0
40 -0.0011 0.0
41 -0.0006 0.0
42 -0.0010 0.0
43 -0.0015 0.0
44 -0.0012 1.0
45 -0.0011 1.0
46 -0.0010 1.0
47 -0.0014 1.0
48 -0.0011 1.0
49 -0.0017 1.0
50 -0.0015 1.0
51 -0.0010 1.0
52 -0.0014 1.0
53 -0.0012 1.0
54 -0.0004 1.0
55 -0.0007 1.0
56 -0.0011 1.0
57 -0.0008 1.0
58 -0.0006 1.0
59 0.0002 0.0
The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).
What I've tried is:
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)
But it's too slow for large dataset.
Do you have any idea how to speed up?
One option is to avoid the use of apply() altogether.
The main idea is to create 2 'helper' columns:
sign: boolean Series indicating if value is positive (True) or negative (False)
id: group identical consecutive occurences together
Finally, we can groupby the id and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign values).
# Setup test dataset
import pandas as pd
import numpy as np
vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})
# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()
# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4
Benchmarking
On my system I get the following benchmarks:
sign = df['val'] >= 0
# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()
# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4
# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thus in total we get an average runtime of: 4.51 ms
For reference, your solution and #Emma 's solution ran respectively on my system in:
# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure this is fast enough for your data size but using min, max seems faster.
With 20k rows,
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)
# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)
# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to read webpage dataset in pandas?

I am trying to read this table
on the webpage: https://datahub.io/sports-data/german-bundesliga
I am using this code:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga"
pd.read_html(url)[2]
It reads another tables but not the tables of this type.
Also there is a link to this specific table:
https://datahub.io/sports-data/german-bundesliga/r/0.html
I also tried this:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga/r/0.html"
pd.read_html(url)
But it says that there are no tables to read
There is no necessity to use the HTML form of the table cause the table is available in CSV format.
pd.read_csv('https://datahub.io/sports-data/german-bundesliga/r/season-1819.csv').head()
output:
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR ... BbAv<2.5 BbAH BbAHh BbMxAHH BbAvAHH BbMxAHA BbAvAHA PSCH PSCD PSCA
0 D1 24/08/2018 Bayern Munich Hoffenheim 3 1 H 1 0 H ... 3.55 22 -2.00 1.92 1.87 2.05 1.99 1.23 7.15 14.10
1 D1 25/08/2018 Fortuna Dusseldorf Augsburg 1 2 A 1 0 H ... 1.76 20 0.00 1.80 1.76 2.17 2.11 2.74 3.33 2.78
2 D1 25/08/2018 Freiburg Ein Frankfurt 0 2 A 0 1 A ... 1.69 20 -0.25 2.02 1.99 1.92 1.88 2.52 3.30 3.07
3 D1 25/08/2018 Hertha Nurnberg 1 0 H 1 0 H ... 1.76 20 -0.25 1.78 1.74 2.21 2.14 1.79 3.61 5.21
4 D1 25/08/2018 M'gladbach Leverkusen 2 0 H 0 0 D ... 2.32 20 0.00 2.13 2.07 1.84 1.78 2.63 3.70 2.69
5 rows × 61 columns

Select rows from DataReader based on value and transfer to DataFrame

I am doing a project where I read in the historical values for a given stock, I then want to filter out the days where the price has jumped +5% or -5% into a different dataframe.
But I am struggling with the transfer of the row.
import pandas_datareader as web
import pandas as pd
import datetime
start = datetime.datetime(2015, 9, 1)
end = datetime.datetime(2019, 11, 2)
df1 = pd.DataFrame()
df = web.DataReader("amd", 'yahoo', start, end)
df['Close'] = df['Close'].astype(float)
df['Open'] = df['Open'].astype(float)
for row in df:
df['perchange'] = ((df['Close']-df['Open'])/df['Open'])*100
df['perchange'] = df['perchange'].astype(float)
if df['perchange'] >= 5.0:
df1 += df
if ['perchange'] <= -5.0:
df1 += df
df.to_csv('amd_volume_price_history.csv')
df1.to_csv('amd_5_to_5.csv')
You can do this to create a new dataframe with
the rows where the percentage of changes is greater than 5% in absolute value. As you can see Series.between has been used to performance a boolean indexing:
not_significant=((df['Close']-df['Open'])/df['Open']).between(-0.05,0.05)
df_filtered=df[~not_significant]
print(df_filtered)
Output
High Low Open Close Volume Adj Close
Date
2015-09-11 2.140000 1.810000 1.880000 2.010000 31010300 2.010000
2015-09-14 2.000000 1.810000 2.000000 1.820000 16458500 1.820000
2015-10-19 2.010000 1.910000 1.910000 2.010000 10670800 2.010000
2015-10-23 2.210000 2.100000 2.100000 2.210000 9564200 2.210000
2015-11-03 2.290000 2.160000 2.160000 2.280000 8705800 2.280000
... ... ... ... ... ... ...
2019-06-06 31.980000 29.840000 29.870001 31.820000 131267800 31.820000
2019-07-31 32.299999 30.299999 32.080002 30.450001 119190000 30.450001
2019-08-08 34.270000 31.480000 31.530001 33.919998 167278800 33.919998
2019-08-12 34.650002 32.080002 34.160000 32.430000 106936000 32.430000
2019-08-23 31.830000 29.400000 31.299999 29.540001 83681100 29.540001
[123 rows x 6 columns]
if you really need perchange column you can create changing the code:
df['Perchange']=(df['Close']-df['Open'])/df['Open']*100
not_significant=(df['Perchange']).between(-5,5)
df_filtered=df[~not_significant]
print(df_filtered)
Also you can use DataFrame.pct_change:
df['Perchange']=df[['Open','Close']].pct_change(axis=1).Close*100
Output
High Low Open Close Volume Adj Close \
Date
2015-09-11 2.140000 1.810000 1.880000 2.010000 31010300 2.010000
2015-09-14 2.000000 1.810000 2.000000 1.820000 16458500 1.820000
2015-10-19 2.010000 1.910000 1.910000 2.010000 10670800 2.010000
2015-10-23 2.210000 2.100000 2.100000 2.210000 9564200 2.210000
2015-11-03 2.290000 2.160000 2.160000 2.280000 8705800 2.280000
... ... ... ... ... ... ...
2019-06-06 31.980000 29.840000 29.870001 31.820000 131267800 31.820000
2019-07-31 32.299999 30.299999 32.080002 30.450001 119190000 30.450001
2019-08-08 34.270000 31.480000 31.530001 33.919998 167278800 33.919998
2019-08-12 34.650002 32.080002 34.160000 32.430000 106936000 32.430000
2019-08-23 31.830000 29.400000 31.299999 29.540001 83681100 29.540001
Perchange
Date
2015-09-11 6.914893
2015-09-14 -8.999997
2015-10-19 5.235603
2015-10-23 5.238102
2015-11-03 5.555550
... ...
2019-06-06 6.528285
2019-07-31 -5.081050
2019-08-08 7.580074
2019-08-12 -5.064401
2019-08-23 -5.622998
[123 rows x 7 columns]
your code would look like this:
#Libraries
import pandas_datareader as web
import pandas as pd
import datetime
#Getting data
start = datetime.datetime(2015, 9, 1)
end = datetime.datetime(2019, 11, 2)
df = web.DataReader("amd", 'yahoo', start, end)
#Convertint to float to calculate and filtering
df['Close'] = df['Close'].astype(float)
df['Open'] = df['Open'].astype(float)
#Creating Perchange column.
df['Perchange']=(df['Close']-df['Open'])/df['Open']*100
#df['Perchange']=df[['Open','Close']].pct_change(axis=1).Close*100
#Filtering
not_significant=(df['Perchange']).between(-5,5)
df_filtered=df[~not_significant]
#Saving data.
df.to_csv('amd_volume_price_history.csv')
df_filtered.to_csv('amd_5_to_5.csv')
EDIT
df['Perchange']=(df['Close']-df['Open'])/df['Open']*100
significant=~(df['Perchange']).between(-5,5)
group_by_jump=significant.cumsum()
jump_and_4=group_by_jump.groupby(group_by_jump,sort=False).cumcount().le(4)&group_by_jump.ne(0)
df_filtered=df[jump_and_4]
print(df_filtered.head(50))
High Low Open Close Volume Adj Close Perchange
Date
2015-09-11 2.14 1.81 1.88 2.01 31010300 2.01 6.914893
2015-09-14 2.00 1.81 2.00 1.82 16458500 1.82 -8.999997
2015-09-15 1.87 1.81 1.84 1.86 6524400 1.86 1.086955
2015-09-16 1.90 1.85 1.87 1.89 4928300 1.89 1.069518
2015-09-17 1.94 1.87 1.90 1.89 5831600 1.89 -0.526315
2015-09-18 1.92 1.85 1.87 1.87 11814000 1.87 0.000000
2015-10-19 2.01 1.91 1.91 2.01 10670800 2.01 5.235603
2015-10-20 2.03 1.97 2.00 2.02 5584200 2.02 0.999999
2015-10-21 2.12 2.01 2.02 2.10 14944100 2.10 3.960392
2015-10-22 2.16 2.09 2.10 2.14 8208400 2.14 1.904772
2015-10-23 2.21 2.10 2.10 2.21 9564200 2.21 5.238102
2015-10-26 2.21 2.12 2.21 2.15 6313500 2.15 -2.714929
2015-10-27 2.16 2.10 2.12 2.15 5755600 2.15 1.415104
2015-10-28 2.20 2.12 2.14 2.18 6950600 2.18 1.869157
2015-10-29 2.18 2.11 2.15 2.13 4500400 2.13 -0.930232
2015-11-03 2.29 2.16 2.16 2.28 8705800 2.28 5.555550
2015-11-04 2.30 2.18 2.27 2.20 8205300 2.20 -3.083698
2015-11-05 2.24 2.17 2.21 2.20 4302200 2.20 -0.452488
2015-11-06 2.21 2.13 2.19 2.15 8997100 2.15 -1.826482
2015-11-09 2.18 2.10 2.15 2.11 6231200 2.11 -1.860474
2015-11-18 2.15 1.98 1.99 2.12 9384700 2.12 6.532657
2015-11-19 2.16 2.09 2.10 2.14 4704300 2.14 1.904772
2015-11-20 2.25 2.13 2.14 2.22 10727100 2.22 3.738314
2015-11-23 2.24 2.18 2.22 2.22 4863200 2.22 0.000000
2015-11-24 2.40 2.17 2.20 2.34 15859700 2.34 6.363630
2015-11-25 2.40 2.31 2.36 2.38 6914800 2.38 0.847467
2015-11-27 2.38 2.32 2.37 2.33 2606600 2.33 -1.687762
2015-11-30 2.37 2.25 2.34 2.36 9924400 2.36 0.854700
2015-12-01 2.37 2.31 2.36 2.34 5646400 2.34 -0.847457
2015-12-16 2.55 2.37 2.39 2.54 19543600 2.54 6.276144
2015-12-17 2.60 2.52 2.52 2.56 11374100 2.56 1.587300
2015-12-18 2.55 2.42 2.51 2.45 17988100 2.45 -2.390436
2015-12-21 2.53 2.43 2.47 2.53 6876600 2.53 2.429147
2015-12-22 2.78 2.54 2.55 2.77 24893200 2.77 8.627452
2015-12-23 2.94 2.75 2.76 2.83 30365300 2.83 2.536229
2015-12-24 3.00 2.86 2.88 2.92 11890900 2.92 1.388888
2015-12-28 3.02 2.86 2.91 3.00 16050500 3.00 3.092780
2015-12-29 3.06 2.97 3.04 3.00 15300900 3.00 -1.315788
2016-01-06 2.71 2.47 2.66 2.51 23759400 2.51 -5.639101
2016-01-07 2.48 2.26 2.43 2.28 22203500 2.28 -6.172843
2016-01-08 2.42 2.10 2.36 2.14 31822400 2.14 -9.322025
2016-01-11 2.36 2.12 2.16 2.34 19629300 2.34 8.333325
2016-01-12 2.46 2.28 2.40 2.39 17986100 2.39 -0.416666
2016-01-13 2.45 2.21 2.40 2.25 12749700 2.25 -6.250004
2016-01-14 2.35 2.21 2.29 2.21 15666600 2.21 -3.493447
2016-01-15 2.13 1.99 2.10 2.03 21199300 2.03 -3.333330
2016-01-19 2.11 1.90 2.08 1.95 18978900 1.95 -6.249994
2016-01-20 1.95 1.75 1.81 1.80 29243600 1.80 -0.552486
2016-01-21 2.18 1.81 1.82 2.09 26387900 2.09 14.835157
2016-01-22 2.17 1.98 2.11 2.02 16245500 2.02 -4.265399
try to integrate your code with these modifications:
1) you probably don't need any loop to calculate the new column:
df['perchange'] = ((df['Close']-df['Open'])/df['Open'])*100
df['perchange'] = df['perchange'].astype(float)
2) define an empty df
df1=pd.DataFrame([])
3) filter the old df with loc method (get used with its notation it is very useful) and append to the empty data frame, this will transfer the rows that verify the condition
df1=df1.append(df.loc[(df['perchange'] <= -5.0) | (df['perchange'] >= -5.0)])
print(df1)
hope it helps

How to group daily time series data into smaller dataframes of weeks

I have a dataframe that looks like this:
open high low close weekday
time
2011-11-29 2.55 2.98 2.54 2.75 1
2011-11-30 2.75 3.09 2.73 2.97 2
2011-12-01 2.97 3.14 2.93 3.06 3
2011-12-02 3.06 3.14 3.03 3.12 4
2011-12-03 3.12 3.13 2.75 2.79 5
2011-12-04 2.79 2.90 2.61 2.83 6
2011-12-05 2.83 2.93 2.78 2.88 0
2011-12-06 2.88 3.05 2.87 3.03 1
2011-12-07 3.03 3.08 2.93 2.99 2
2011-12-08 2.99 3.01 2.88 2.98 3
2011-12-09 2.98 3.04 2.93 2.97 4
2011-12-10 2.97 3.13 2.93 3.05 5
2011-12-11 3.05 3.38 2.99 3.25 6
The weekday column refers to 0 = Monday,...6 = Sunday.
I want to make groups of smaller dataframes only containing the data for Friday, Saturday, Sunday and Monday. So one subset would look like this:
2011-12-02 3.06 3.14 3.03 3.12 4
2011-12-03 3.12 3.13 2.75 2.79 5
2011-12-04 2.79 2.90 2.61 2.83 6
2011-12-05 2.83 2.93 2.78 2.88 0
filter before drop_duplicates
df[df.weekday.isin([4,5,6,0])].drop_duplicates('weekday')
Out[10]:
open high low close weekday
2011-12-02 3.06 3.14 3.03 3.12 4
2011-12-03 3.12 3.13 2.75 2.79 5
2011-12-04 2.79 2.90 2.61 2.83 6
2011-12-05 2.83 2.93 2.78 2.88 0

Work out the MAX of a Python Pandas dataframe within a date range

I have a set of stock market data, sampled below.
I would like to like to work out the MAX ‘close’ price over each 5 day period.
symbol date open high low close volume
AAU 1-Jan-07 2.25 2.25 2.25 2.25 0
AAU 2-Jan-07 2.25 2.25 2.25 2.25 0
AAU 3-Jan-07 2.32 2.32 2.26 2.26 39800
AAU 4-Jan-07 2.29 2.35 2.27 2.32 114200
AAU 5-Jan-07 2.32 2.32 2.26 2.27 113600
AAU 8-Jan-07 2.27 2.35 2.1 2.33 84500
AAU 9-Jan-07 2.31 2.31 2.21 2.23 54200
AAU 10-Jan-07 2.24 2.3 2.2 2.3 29000
AAU 11-Jan-07 2.23 2.33 2.22 2.24 21400
AAU 12-Jan-07 2.25 2.33 2.25 2.33 45200
To do this I have added a new column to calculate the end date range (+5 days):
df[‘1w_date'] = df[‘date'].shift(-6)
The df then looks like this:
symbol date open high low close volume 5d_date
AAU 1-Jan-07 2.25 2.25 2.25 2.25 0 8-Jan-07
AAU 2-Jan-07 2.25 2.25 2.25 2.25 0 9-Jan-07
AAU 3-Jan-07 2.32 2.32 2.26 2.26 39800 10-Jan-07
AAU 4-Jan-07 2.29 2.35 2.27 2.32 114200 11-Jan-07
AAU 5-Jan-07 2.32 2.32 2.26 2.27 113600 12-Jan-07
AAU 8-Jan-07 2.27 2.35 2.1 2.33 84500 15-Jan-07
AAU 9-Jan-07 2.31 2.31 2.21 2.23 54200 16-Jan-07
AAU 10-Jan-07 2.24 2.3 2.2 2.3 29000 17-Jan-07
AAU 11-Jan-07 2.23 2.33 2.22 2.24 21400 18-Jan-07
AAU 12-Jan-07 2.25 2.33 2.25 2.33 45200 19-Jan-07
Next I set the date column as the df Index:
df = df.set_index(['date'])
Then I attempt to loop through each row using the ‘date’ as the start date and the ‘5d_date’ as the end date.
for i in df:
date_filter = df.loc[df[‘date’]:df[‘5d_date']]
df[‘min_value'] = min(date_filter['low'])
df[‘max_value'] = max(date_filter['high'])
Unfortunately I get a KeyError: ‘date’.
I have tried many different ways, but cannot figure out how to do this. Does anyone know how to fix this, or a better way of doing it?
Thanks.
After you set the index to date, you can use pd.DataFrame.rolling:
df.rolling('7d')['close'].mean()
Out[93]:
date
2007-01-01 2.250000
2007-01-02 2.250000
2007-01-03 2.253333
2007-01-04 2.270000
2007-01-05 2.270000
2007-01-08 2.286000
2007-01-09 2.282000
2007-01-10 2.290000
2007-01-11 2.274000
2007-01-12 2.286000
Name: close, dtype: float64
or, even without doing so,
df.rolling(5)['close'].mean()
Out[94]:
date
2007-01-01 NaN
2007-01-02 NaN
2007-01-03 NaN
2007-01-04 NaN
2007-01-05 2.270
2007-01-08 2.286
2007-01-09 2.282
2007-01-10 2.290
2007-01-11 2.274
2007-01-12 2.286
Name: close, dtype: float64
depending on whether you want a week (1), or five rows of data (2).
To have either of these at the start of the range instead of the end, just add .shift(-4) to the latter, and even to the former if you really do have exactly five days per week, every week.

Categories