I am having trouble with a complicated problem that I will try my best to describe.
I have a text file that has the following information
Customer Name: Zack Customer Number:12345
10.4 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order
10.5 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order
Customer Name: Larry Customer Number:00099
1.4 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order
1.5 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order
Customer Name: James Customer Number:99999
5.4 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order
5.5 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order
And So on..... in the same format.
What I want to do, is add the Customer Name value, "Zack" and Customer Number "12345" to the end of the line and get rid of the current format.
Eventually ending up with this new format.
10.4 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order Zach 12345
10.5 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order Zack 12345
1.4 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order Larry 00099
1.5 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order Larry 00099
5.4 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order James 99999
5.5 2014556 - FSV Poly -1.50 Feb 16 6 Each Unit Order James 99999
My thinking is something like... if line starts with a number then add the last customer Name value and customer Number value to the end of the line???
Is this even possible?
Thank you so much for your time!
Here is my code:
import re
file = open('Orders.txt', 'r')
for line in file:
if line.__contains__('Customer Name') or line[0].isdigit():
print(line.lstrip())
And here is a better example of the data:
First of all, be careful with opening a file, you have then to call to file.close() !
What you could do is iterate through the file and change the customer Name and number every time you find one
Here is my take:
customer_number = 0
customer_name = ''
with open('Orders.txt', 'r') as file: # Usually it's how files are opened
while True:
line = file.readline()
if line == '': # We get out of the loop when we're finished
break
if 'Customer Name' in line:
# Change customer number and name
# Quite hacky way to do it
listline = line.split('Customer')
customer_name = listline[0]
customer_number = listline[1]
else :
print('%s, %s, %s' % (line.strip(), customer_name, customer_number))
Note that at the end of the line will be written 'Name: yourcustomername', and 'Number: yourcustomernumber' with this method.
You can quite easily get only name and number of this but I'll let you find out yourself :)
I'd also recommend you to save your changes in a file rather than simply print them.
Related
I worked on calculating churn using the mix of pandas dataframe of git logs and git show command for a particular commit to see where exactly the changes has been done based on loc. However, I could not able to calculate churn based on the days i.e. I mean calculate churn when an engineer rewrites or deletes their own code that is less than 3 weeks old.
This is how I have done for such dataframe for each commit based
git logs dataframe
sha timestamp date author message body age insertion deletion filepath churn merges
1 1 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997
2 21 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997 31.0 0.0 packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js 31.0
3 22 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997 1.0 1.0 packages/react-test-renderer/src/ReactTestRenderer.js 0.0
4 23 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997 24.0 14.0 packages/react/src/ReactAct.js 10.0
5 25 e2453e200 Fri Jun 25 15:39:46 2021 -0400 2021-06-25 19:39:46+00:00 Andrew Clark act: Add test for bypassing queueMicrotask (#21743) Test for fix added in #21740 -25 days +13:09:55.839997 50.0 0.0 packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js 50.0
6 27 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 4.0 5.0 packages/react-devtools-shared/src/__tests__/FastRefreshDevToolsIntegration-test.js -1.0
7 28 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 4.0 4.0 packages/react-devtools-shared/src/__tests__/componentStacks-test.js 0.0
8 29 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 12.0 12.0 packages/react-devtools-shared/src/__tests__/console-test.js 0.0
9 30 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 7.0 6.0 packages/react-devtools-shared/src/__tests__/editing-test.js 1.0
10 31 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 47.0 42.0 packages/react-devtools-shared/src/__tests__/inspectedElement-test.js 5.0
11 32 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 7.0 6.0 packages/react-devtools-shared/src/__tests__/ownersListContext-test.js 1.0
12 33 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 22.0 21.0 packages/react-devtools-shared/src/__tests__/profilerContext-test.js 1.0
churn calculation
commits = df["sha"].unique().tolist()
for commit in commits:
contribution, churn = await self.calculate_churn(commit)
async def calculate_churn(self, stream):
PREVIOUS_BASE_DIR = os.path.abspath("")
try:
GIT_DIR = os.path.join(PREVIOUS_BASE_DIR, "app/git/react.git")
os.chdir(GIT_DIR)
except FileNotFoundError as e:
raise ValueError(e)
cmd = f"git show --format= --unified=0 --no-prefix {stream}"
cmds = [f"{cmd}"]
results = get_proc_out(cmds)
[files, contribution, churn] = get_loc(results)
# need to circle back to previous path
os.chdir(PREVIOUS_BASE_DIR)
return contribution, churn
def is_new_file(result, file):
# search for destination file (+++ ) and update file variable
if result.startswith("+++"):
return result[result.rfind(" ") + 1 :]
else:
return file
def is_loc_change(result, loc_changes):
# search for loc changes (## ) and update loc_changes variable
# ## -1,5 +1,4 ##
# ## -l,s +l,s ##
if result.startswith("##"):
# loc_change = result[2+1: ] -> -1,5 +1,4 ##
loc_change = result[result.find(" ") + 1 :]
# loc_change = loc_change[:9] -> -1,5 +1,4
loc_change = loc_change[: loc_change.find(" ##")]
return loc_change
else:
return loc_changes
def get_loc_change(loc_changes):
# removals
# -1,5 +1,4 = -1,5
left = loc_changes[: loc_changes.find(" ")]
left_dec = 0
# 2
if left.find(",") > 0:
# 2
comma = left.find(",")
# 5
left_dec = int(left[comma + 1 :])
# 1
left = int(left[1:comma])
else:
left = int(left[1:])
left_dec = 1
# additions
# +1,4
right = loc_changes[loc_changes.find(" ") + 1 :]
right_dec = 0
if right.find(",") > 0:
comma = right.find(",")
right_dec = int(right[comma + 1 :])
right = int(right[1:comma])
else:
right = int(right[1:])
right_dec = 1
if left == right:
return {left: (right_dec - left_dec)}
else:
return {left: left_dec, right: right_dec}
def get_loc(results):
files = {}
contribution = 0
churn = 0
file = ""
loc_changes = ""
for result in results:
new_file = is_new_file(result, file)
if file != new_file:
file = new_file
if file not in files:
files[file] = {}
else:
new_loc_changes = is_loc_change(
result, loc_changes
) # returns either empmty or -6 +6 or -13, 0 +14, 2 format
if loc_changes != new_loc_changes:
loc_changes = new_loc_changes
locc = get_loc_change(loc_changes) # {2: 0} or {8: 0, 9: 1}
for loc in locc:
# files[file] = {2: 0, 8: 0, 9: 1}
# print("loc", loc, files[file], locc[loc])
if loc in files[file]:
# change of lines triggered
files[file][loc] += locc[loc]
churn += abs(locc[loc])
else:
files[file][loc] = locc[loc]
contribution += abs(locc[loc])
else:
continue
return [files, contribution, churn]
How can I utilize this same code but check churn only if there is changes in code that is only 3 weeks old?
The only practical way to do this is to iterate through the DataDrame, and because that sucks with pandas, it almost always means you have the wrong data structure. If you're not doing numerical analysis, and it looks like you aren't, then just keep a simple list of dicts. Pandas has its shining points, but it's not a universal database.
Here's the rough code you'd need, although I'm glossing over details:
# Go through the df row by row.
lastdate = {}
for index,row in df.iterrows():
if row['filepath'] in lastdate:
if lastdate[row['filepath']] - row['date'] < timedelta(days=21):
print( "Last change to", row['filepath'], "was within three weeks" )
lastdate[row['filepath']] = row['date']
I have a sample dataset which contains id and costs in diff years as the one below:
Id
2015-04
2015-05
2015-06
2015-07
2016-04
2016-05
2016-06
2016-07
2017-04
2017-05
2017-06
2017-07
2018-04
2018-05
2018-06
2018-07
10
58500
58500
58300
57800
57500
57700
57800
57800
57800
57900
58400
59000
59500
59500
59000
58500
11
104600
104600
105700
106100
106300
107300
108000
107600
107800
108300
109200
109600
109300
108700
109000
110700
12
104900
106700
107900
107500
106100
105200
105700
106400
106700
107100
107200
107100
107500
108300
109200
110500
13
50500
49600
48900
48400
48100
48000
47700
47500
47400
47600
47800
47800
47600
47600
48100
48400
14
49800
49900
50300
50800
51100
51200
51200
51400
51600
51900
52400
52600
52300
51800
51100
50900
How can I create a function in Python to find the median cost of each year belonging to their respective id? I want the function to be dynamic in terms of the start and end year so that if new data comes for different years, the code will calculate the changes accordingly. For example, if new data comes for 2019, the end date would automatically be considered as 2019 instead of 2018 and calculate its median respectively.
With the current data sample given above, the result should look something like one below:
Id
2015
2016
2017
2018
10
58400
57750
58150
59250
11
105150
107450
108750
109150
12
107100
105900
107100
108750
13
49250
47850
47700
47850
14
50100
51200
52150
51450
First we split the column names on - and get only the year. Then we groupby over axis=1 based on these years and take the median:
df = df.set_index("Id")
df = df.groupby(df.columns.str.split("-").str[0], axis=1).median().reset_index()
# or get first 4 characters
# df = df.groupby(df.columns.str[:4], axis=1).median().reset_index()
Id 2015 2016 2017 2018
0 10 58400 57750 58150 59250
1 11 105150 107450 108750 109150
2 12 107100 105900 107100 108750
3 13 49250 47850 47700 47850
4 14 50100 51200 52150 51450
I'm scraping a website that has a table of satellite values (https://planet4589.org/space/gcat/data/cat/satcat.html).
Because every entry is only separated by whitespace, I need a way to split the string of data entries into an array.
However, the .split() function does not suit my needs, because some of the data entries have spaces (e.g. Able 3), I can't just split everything separated by whitespace.
It gets trickier, however. In some cases where no data is available, a dash ("-") is used. If two data entries are separated by only a space, and one of them is a dash, I don't want to include it as one entry.
e.g say we have the two entries "Able 3" and "-", separated only by a single space. In the file, they would appear as "Able 3 -". I want to split this string into the separate data entries, "Able 3" and "-" (as a list, this would be ["Able 3", "-"]).
Another example would be the need to split "data1 -" into ["data1", "-"]
Pretty much, I need to take a string and split it into a list or words separated by whitespace, except when there is a single space between words, and one of them is not a dash.
Also, as you can see the table is massive. I thought about looping through every character, but that would be too slow, and I need to run this thousands of times.
Here is a sample from the beginning of the file:
JCAT Satcat Piece Type Name PLName LDate Parent SDate Primary DDate Status Dest Owner State Manufacturer Bus Motor Mass DryMass TotMass Length Diamete Span Shape ODate Perigee Apogee Inc OpOrbitOQU AltNames
S00001 00001 1957 ALP 1 R2 8K71A M1-10 8K71A M1-10 (M1-1PS) 1957 Oct 4 - 1957 Oct 4 1933 Earth 1957 Dec 1 1000? R - OKB1 SU OKB1 Blok-A - 7790 7790 7800 ? 28.0 2.6 28.0 Cyl 1957 Oct 4 214 938 65.10 LLEO/I -
S00002 00002 1957 ALP 2 P 1-y ISZ PS-1 1957 Oct 4 S00001 1957 Oct 4 1933 Earth 1958 Jan 4? R - OKB1 SU OKB1 PS - 84 84 84 0.6 0.6 2.9 Sphere + Ant 1957 Oct 4 214 938 65.10 LLEO/I -
S00003 00003 1957 BET 1 P A 2-y ISZ PS-2 1957 Nov 3 A00002 1957 Nov 3 0235 Earth 1958 Apr 14 0200? AR - OKB1 SU OKB1 PS - 508 508 8308 ? 2.0 1.0 2.0 Cone 1957 Nov 3 211 1659 65.33 LEO/I -
S00004 00004 1958 ALP P A Explorer 1 Explorer 1 1958 Feb 1 A00004 1958 Feb 1 0355 Earth 1970 Mar 31 1045? AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Feb 1 359 2542 33.18 LEO/I -
S00005 00005 1958 BET 2 P Vanguard I Vanguard Test Satellite 1958 Mar 17 S00016 1958 Mar 17 1224 Earth - O - NRL US NRL NRL 6" - 2 2 2 0.1 0.1 0.1 Sphere 1959 May 23 657 3935 34.25 MEO -
S00006 00006 1958 GAM P A Explorer 3 Explorer 3 1958 Mar 26 A00005 1958 Mar 26 1745 Earth 1958 Jun 28 AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Mar 26 195 2810 33.38 LEO/I -
S00007 00007 1958 DEL 1 R2 8K74A 8K74A 1958 May 15 - 1958 May 15 0705 Earth 1958 Dec 3 R - OKB1 SU OKB1 Blok-A - 7790 7790 7820 ? 28.0 2.6 28.0 Cyl 1958 May 15 214 1860 65.18 LEO/I -
S00008 00008 1958 DEL 2 P 3-y Sovetskiy ISZ D-1 No. 2 1958 May 15 S00007 1958 May 15 0706 Earth 1960 Apr 6 R - OKB1 SU OKB1 Object D - 1327 1327 1327 3.6 1.7 3.6 Cone 1959 May 7 207 1247 65.12 LEO/I -
S00009 00009 1958 EPS P A Explorer 4 Explorer 4 1958 Jul 26 A00009 1958 Jul 26 1507 Earth 1959 Oct 23 AR - ABMA/JPL US JPL Explorer - 12 12 17 0.8 0.1 0.8 Cyl 1959 Apr 24 258 2233 50.40 LEO/I -
S00010 00010 1958 ZET P A SCORE SCORE 1958 Dec 18 A00015 1958 Dec 18 2306 Earth 1959 Jan 21 AR - ARPA/SRDL US SRDL SCORE - 68 68 3718 2.5 ? 1.5 ? 2.5 Cone 1958 Dec 30 159 1187 32.29 LEO/I -
S00011 00011 1959 ALP 1 P Vanguard II Cloud cover satellite 1959 Feb 17 S00012 1959 Feb 17 1605 Earth - O - BSC US NRL NRL 20" - 10 10 10 0.5 0.5 0.5 Sphere 1959 May 15 564 3304 32.88 MEO -
S00012 00012 1959 ALP 2 R3 GRC 33-KS-2800 GRC 33-KS-2800 175-15-21 1959 Feb 17 R02749 1959 Feb 17 1604 Earth - O - BSC US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Apr 28 564 3679 32.88 MEO -
S00013 00013 1959 BET P A Discoverer 1 CORONA Test Vehicle 2 1959 Feb 28 A00017 1959 Feb 28 2156 Earth 1959 Mar 5 AR - ARPA/CIA US LMSD CORONA - 78 ? 78 ? 668 ? 2.0 1.5 2.0 Cone 1959 Feb 28 163? 968? 89.70 LLEO/P -
S00014 00014 1959 GAM P A Discoverer 2 CORONA BIO 1 1959 Apr 13 A00021 1959 Apr 13 2126 Earth 1959 Apr 26 AR - ARPA/CIA US LMSD CORONA - 110 ? 110 ? 788 1.3 1.5 1.3 Frust 1959 Apr 13 239 346 89.90 LLEO/P -
S00015 00015 1959 DEL 1 P Explorer 6 NASA S-2 1959 Aug 7 S00017 1959 Aug 7 1430 Earth 1961 Jul 1 R? - GSFC US TRW Able Probe ARC 420 40 40 42 ? 0.7 0.7 2.2 Sphere + 4 Pan 1959 Sep 8 250 42327 46.95 HEO - Able 3
S00016 00016 1958 BET 1 R3 GRC 33-KS-2800 GRC 33-KS-2800 144-79-22 1958 Mar 17 R02064 1958 Mar 17 1223 Earth - O - NRL US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Sep 30 653 4324 34.28 MEO -
S00017 00017 1959 DEL 2 R3 Altair Altair X-248 1959 Aug 7 A00024 1959 Aug 7 1428 Earth 1961 Jun 30 R? - USAF US ABL Altair - 24 24 24 1.5 0.5 1.5 Cyl 1961 Jan 8 197 40214 47.10 GTO -
S00018 00018 1959 EPS 1 P A Discoverer 5 CORONA C-2 1959 Aug 13 A00028 1959 Aug 13 1906 Earth 1959 Sep 28 AR - ARPA/CIA US LMSD CORONA - 140 140 730 1.3 1.5 1.3 Frust 1959 Aug 14 215 732 80.00 LLEO/I - NRO Mission 9002
A less haphazard approach would be to interpret the headers on the first line as column indicators, and split on those widths.
import sys
import re
def col_widths(s):
# Shamelessly adapted from https://stackoverflow.com/a/33090071/874188
cols = re.findall(r'\S+\s+', s)
return [len(col) for col in cols]
widths = col_widths(next(sys.stdin))
for line in sys.stdin:
line = line.rstrip('\n')
fields = []
for col_max in widths[:-1]:
fields.append(line[0:col_max].strip())
line = line[col_max:]
fields.append(line)
print(fields)
Demo: https://ideone.com/ASANjn
This seems to provide a better interpretation of e,g. the LDate column, where the dates are sometimes padded with more than one space. The penultimate column preserves the final dash as part of the column value; this seems more consistent with the apparent intent of the author of the original table, though perhaps separately split that off from that specific column if that's not to your liking.
If you don't want to read sys.stdin, just wrap this in with open(filename) as handle: and replace sys.stdin with handle everywhere.
One approach is to use pandas.read_fwf(), which reads text files in fixed-width format. The function returns Pandas DataFrames, which are useful for handling large data sets.
As a quick taste, here's what this simple bit of code does:
import pandas as pd
data = pd.read_fwf("data.txt")
print(data.columns) # Prints an index of all columns.
print()
print(data.head(5)) # Prints the top 5 rows.
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate',
# 'Unnamed: 7', 'Parent', 'SDate', 'Unnamed: 10', 'Unnamed: 11',
# 'Primary', 'DDate', 'Unnamed: 14', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'Unnamed: 23', 'DryMass',
# 'Unnamed: 25', 'TotMass', 'Unnamed: 27', 'Length', 'Unnamed: 29',
# 'Diamete', 'Span', 'Unnamed: 32', 'Shape', 'ODate', 'Unnamed: 35',
# 'Perigee', 'Apogee', 'Inc', 'OpOrbitOQU', 'AltNames'],
# dtype='object')
#
# JCAT Satcat Piece Type ... Apogee Inc OpOrbitOQU AltNames
# 0 S00001 1 1957 ALP 1 R2 ... 938 65.10 LLEO/I - NaN
# 1 S00002 2 1957 ALP 2 P ... 938 65.10 LLEO/I - NaN
# 2 S00003 3 1957 BET 1 P A ... 1659 65.33 LEO/I - NaN
# 3 S00004 4 1958 ALP P A ... 2542 33.18 LEO/I - NaN
# 4 S00005 5 1958 BET 2 P ... 3935 34.25 MEO - NaN
You'll note that some of the columns are unnamed. We can solve this by determining the field widths of the file, guiding the read_fwf()'s parsing. We'll achieve this by reading the first line of the file and iterating over it.
field_widths = [] # We'll append column widths into this list.
last_i = 0
new_field = False
for i, x in enumerate(first_line):
if x != ' ' and new_field:
# Register a new field.
new_field = False
field_widths.append(i - last_i) # Get the field width by subtracting
# the index from previous field's index.
last_i = i # Set the new field index.
elif not new_field and x == ' ':
# We've encountered a space.
new_field = True # Set true so that the next
# non-space encountered is
# recognised as a new field
else:
field_widths.append(64) # Append last field. Set to a high number,
# so that all strings are eventually read.
Just a simple for-loop. Nothing fancy.
All that's left is passing the field_widths list through the widths= keyword arg:
data = pd.read_fwf("data.txt", widths=field_widths)
print(data.columns)
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate', 'Parent',
# 'SDate', 'Primary', 'DDate', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'DryMass', 'TotMass', 'Length',
# 'Diamete', 'Span', 'Shape', 'ODate', 'Perigee', 'Apogee', 'Inc',
# 'OpOrbitOQU'],
# dtype='object')
data is a dataframe, but with some work, you can change it to a list of lists or a list of dicts. Or you could also work with the dataframe directly.
So say, you want the first row. Then you could do
datalist = data.values.tolist()
print(datalist[0])
# ['S00001', 1, '1957 ALP 1', 'R2', '8K71A M1-10', '8K71A M1-10 (M1-1PS)', '1957 Oct 4', '-', '1957 Oct 4 1933', 'Earth', '1957 Dec 1 1000?', 'R', '-', 'OKB1', 'SU', 'OKB1', 'Blok-A', '-', '7790', '7790', '7800 ?', '28.0', '2.6', '28.0', 'Cyl', '1957 Oct 4', '214', '938', '65.10', 'LLEO/I -']
I have a text file that is output from a command that I ran with Netmiko to retrieve data from a Cisco WLC of things that are causing interference on our WiFi network. I stripped out just what I needed from the original 600k lines of code down to a couple thousand lines like this:
AP Name.......................................... 010-HIGH-FL4-AP04
Microwave Oven 11 10 -59 Mon Dec 18 08:21:23 2017
WiMax Mobile 11 0 -84 Fri Dec 15 17:09:45 2017
WiMax Fixed 11 0 -68 Tue Dec 12 09:29:30 2017
AP Name.......................................... 010-2nd-AP04
Microwave Oven 11 10 -61 Sat Dec 16 11:20:36 2017
WiMax Fixed 11 0 -78 Mon Dec 11 12:33:10 2017
AP Name.......................................... 139-FL1-AP03
Microwave Oven 6 18 -51 Fri Dec 15 12:26:56 2017
AP Name.......................................... 010-HIGH-FL3-AP04
Microwave Oven 11 10 -55 Mon Dec 18 07:51:23 2017
WiMax Mobile 11 0 -83 Wed Dec 13 16:16:26 2017
The goal is to end up with a csv file that strips out the 'AP Name ...' and puts what left on the same line as the rest of the information in the next line. The problem is some have two lines below the AP name and some have 1 or none. I have been at it for 8 hours and cannot find the best way to make this happen.
This is the latest version of code that I was trying to use, any suggestions for making this work? I just want something I can load up in excel and create a report with:
with open(outfile_name, 'w') as out_file:
with open('wlc-interference_raw.txt', 'r')as in_file:
#Variables
_ap_name = ''
_temp = ''
_flag = False
for i in in_file:
if 'AP Name' in i:
#write whatever was put in the temp file to disk because new ap now
#add another temp variable in case an ap has more than 1 interferer and check if new AP name
out_file.write(_temp)
out_file.write('\n')
#print(_temp)
_ap_name = i.lstrip('AP Name.......................................... ')
_ap_name = _ap_name.rstrip('\n')
_temp = _ap_name
#print(_temp)
elif '----' in i:
pass
elif 'Class Type' in i:
pass
else:
line_split = i.split()
for x in line_split:
_temp += ','
_temp += x
_temp += '\n'
I think your best option is to read all lines of the file, then split into sections starting with AP Name. Then you can work on parsing each section.
Example
s = """AP Name.......................................... 010-HIGH-FL4-AP04
Microwave Oven 11 10 -59 Mon Dec 18 08:21:23 2017
WiMax Mobile 11 0 -84 Fri Dec 15 17:09:45 2017
WiMax Fixed 11 0 -68 Tue Dec 12 09:29:30 2017
AP Name.......................................... 010-2nd-AP04
Microwave Oven 11 10 -61 Sat Dec 16 11:20:36 2017
WiMax Fixed 11 0 -78 Mon Dec 11 12:33:10 2017
AP Name.......................................... 139-FL1-AP03
Microwave Oven 6 18 -51 Fri Dec 15 12:26:56 2017
AP Name.......................................... 010-HIGH-FL3-AP04
Microwave Oven 11 10 -55 Mon Dec 18 07:51:23 2017
WiMax Mobile 11 0 -83 Wed Dec 13 16:16:26 2017"""
import re
class AP:
"""
A class holding each section of the parsed file
"""
def __init__(self):
self.header = ""
self.content = []
sections = []
section = None
for line in s.split('\n'): # Or 'for line in file:'
# Starting new section
if line.startswith('AP Name'):
# If previously had a section, add to list
if section is not None:
sections.append(section)
section = AP()
section.header = line
else:
if section is not None:
section.content.append(line)
sections.append(section) # Add last section outside of loop
for section in sections:
ap_name = section.header.lstrip("AP Name.") # lstrip takes all the characters given, not a literal string
for line in section.content:
print(ap_name + ",", end="")
# You can extract the date separately, if needed
# Splitting on more than one space using a regex
line = ",".join(re.split(r'\s\s+', line))
print(line.rstrip(',')) # Remove trailing comma from imperfect split
Output
010-HIGH-FL4-AP04,Microwave Oven,11,10,-59,Mon Dec 18 08:21:23 2017
010-HIGH-FL4-AP04,WiMax Mobile,11,0,-84,Fri Dec 15 17:09:45 2017
010-HIGH-FL4-AP04,WiMax Fixed,11,0,-68,Tue Dec 12 09:29:30 2017
010-2nd-AP04,Microwave Oven,11,10,-61,Sat Dec 16 11:20:36 2017
010-2nd-AP04,WiMax Fixed,11,0,-78,Mon Dec 11 12:33:10 2017
139-FL1-AP03,Microwave Oven,6,18,-51,Fri Dec 15 12:26:56 2017
010-HIGH-FL3-AP04,Microwave Oven,11,10,-55,Mon Dec 18 07:51:23 2017
010-HIGH-FL3-AP04,WiMax Mobile,11,0,-83,Wed Dec 13 16:16:26 2017
Tip:
You don't need Python to write the CSV, you can output to a file using the command line
python script.py > output.csv
I'm trying to figure out how to implement an automatic backup file naming/recycling strategy that keeps older backup files but with decreasing frequency over time. The basic idea is that it would be possible to remove at maximum one file when adding a new one, but I was not successful implementing this from scratch.
That's why I started to try out the Grandfather-Father-Son pattern, but there is not a requirement to stick to this. I started my experiments using a single pool, but I failed more than once, so I started again from this more descriptive approach using four pools, one for each frequency:[1]
import datetime
t = datetime.datetime(2001, 1, 1, 5, 0, 0) # start at 1st of Jan 2001, at 5:00 am
d = datetime.timedelta(days=1)
days = []
weeks = []
months = []
years = []
def pool_it(t):
days.append(t)
if len(days) > 7: # keep not more than seven daily backups
del days[0]
if (t.weekday() == 6):
weeks.append(t)
if len(weeks) > 5: # ...not more than 5 weekly backups
del weeks[0]
if (t.day == 28):
months.append(t)
if len(months) > 12: # ... limit monthly backups
del months[0]
if (t.day == 28 and t.month == 12):
years.append(t)
if len(years) > 10: # ... limit yearly backups...
del years[0]
for i in range(4505):
pool_it(t)
t += d
no = 0
def print_pool(pool, rt):
global no
print("----")
for i in pool:
no += 1
print("{:3} {} {}".format(no, i.strftime("%Y-%m-%d %a"), (i-rt).days))
print_pool(years, t)
print_pool(months,t)
print_pool(weeks,t)
print_pool(days,t)
The output shows that there are duplicates, marked with * and **
----
1 2003-12-28 Sun -3414
2 2004-12-28 Tue -3048
3 2005-12-28 Wed -2683
4 2006-12-28 Thu -2318
5 2007-12-28 Fri -1953
6 2008-12-28 Sun -1587
7 2009-12-28 Mon -1222
8 2010-12-28 Tue -857
9 2011-12-28 Wed -492
10 2012-12-28 Fri -126 *
----
11 2012-05-28 Mon -340
12 2012-06-28 Thu -309
13 2012-07-28 Sat -279
14 2012-08-28 Tue -248
15 2012-09-28 Fri -217
16 2012-10-28 Sun -187
17 2012-11-28 Wed -156
18 2012-12-28 Fri -126 *
19 2013-01-28 Mon -95
20 2013-02-28 Thu -64
21 2013-03-28 Thu -36
22 2013-04-28 Sun -5 **
----
23 2013-03-31 Sun -33
24 2013-04-07 Sun -26
25 2013-04-14 Sun -19
26 2013-04-21 Sun -12
27 2013-04-28 Sun -5 **
----
28 2013-04-26 Fri -7
29 2013-04-27 Sat -6
30 2013-04-28 Sun -5 **
31 2013-04-29 Mon -4
32 2013-04-30 Tue -3
33 2013-05-01 Wed -2
34 2013-05-02 Thu -1
...which is not a big problem. What I'm getting from it is daily backups in the last week, weekly backups for the last month, monthly backups for the last year, and yearly backups for 10 years. The amount of files is always limited to 10+12+5+7=34.
My ideal solution would
create files with human-readable names including timestampes (i.e. xyz-yyyy-mm-dd.bak)
use only one pool (store/remove files within one folder)
recycle targeted, that is, would not delete more than one file a day
(naturally) not contain any duplicates
Do you have a trivial solution at hand or a suggestion where to learn more about it?
[1] I used python as to better understand/communicate my question, but the question is about the algorithm.
As a committer of pyExpireBackups i can point you to the ExpirationRule implementation of my solution (source below and in the github repo)
see https://wiki.bitplan.com/index.php/PyExpireBackups for the doku.
An example run would lead to:
keeping 7 files for dayly backup
keeping 6 files for weekly backup
keeping 8 files for monthly backup
keeping 4 files for yearly backup
expiring 269 files dry run
# 1✅: 0.0 days( 5 GB/ 5 GB)→./sql_backup.2022-04-02.tgz
# 2✅: 3.0 days( 5 GB/ 9 GB)→./sql_backup.2022-03-30.tgz
# 3✅: 4.0 days( 5 GB/ 14 GB)→./sql_backup.2022-03-29.tgz
# 4✅: 5.0 days( 5 GB/ 18 GB)→./sql_backup.2022-03-28.tgz
# 5✅: 7.0 days( 5 GB/ 23 GB)→./sql_backup.2022-03-26.tgz
# 6✅: 9.0 days( 5 GB/ 27 GB)→./sql_backup.2022-03-24.tgz
# 7✅: 11.0 days( 5 GB/ 32 GB)→./sql_backup.2022-03-22.tgz
# 8❌: 15.0 days( 5 GB/ 37 GB)→./sql_backup.2022-03-18.tgz
# 9❌: 17.0 days( 5 GB/ 41 GB)→./sql_backup.2022-03-16.tgz
# 10✅: 18.0 days( 5 GB/ 46 GB)→./sql_backup.2022-03-15.tgz
# 11❌: 19.0 days( 5 GB/ 50 GB)→./sql_backup.2022-03-14.tgz
# 12❌: 20.0 days( 5 GB/ 55 GB)→./sql_backup.2022-03-13.tgz
# 13❌: 22.0 days( 5 GB/ 59 GB)→./sql_backup.2022-03-11.tgz
# 14❌: 23.0 days( 5 GB/ 64 GB)→./sql_backup.2022-03-10.tgz
# 15✅: 35.0 days( 4 GB/ 68 GB)→./sql_backup.2022-02-26.tgz
# 16❌: 37.0 days( 4 GB/ 73 GB)→./sql_backup.2022-02-24.tgz
# 17❌: 39.0 days( 4 GB/ 77 GB)→./sql_backup.2022-02-22.tgz
# 18❌: 40.0 days( 5 GB/ 82 GB)→./sql_backup.2022-02-21.tgz
# 19✅: 43.0 days( 4 GB/ 86 GB)→./sql_backup.2022-02-18.tgz
...
class ExpirationRule():
'''
an expiration rule keeps files at a certain
'''
def __init__(self,name,freq:float,minAmount:int):
'''
constructor
name(str): name of this rule
freq(float): the frequency) in days
minAmount(int): the minimum of files to keep around
'''
self.name=name
self.ruleName=name # will late be changed by a sideEffect in getNextRule e.g. from "week" to "weekly"
self.freq=freq
self.minAmount=minAmount
if minAmount<0:
raise Exception(f"{self.minAmount} {self.name} is invalid - {self.name} must be >=0")
def reset(self,prevFile:BackupFile):
'''
reset my state with the given previous File
Args:
prevFile: BackupFile - the file to anchor my startAge with
'''
self.kept=0
if prevFile is None:
self.startAge=0
else:
self.startAge=prevFile.ageInDays
def apply(self,file:BackupFile,prevFile:BackupFile,debug:bool)->bool:
'''
apply me to the given file taking the previously kept File prevFile (which might be None) into account
Args:
file(BackupFile): the file to apply this rule for
prevFile(BackupFile): the previous file to potentially take into account
debug(bool): if True show debug output
'''
if prevFile is not None:
ageDiff=file.ageInDays - prevFile.ageInDays
keep=ageDiff>=self.freq
else:
ageDiff=file.ageInDays - self.startAge
keep=True
if keep:
self.kept+=1
else:
file.expire=True
if debug:
print(f"Δ {ageDiff}({ageDiff-self.freq}) days for {self.ruleName}({self.freq}) {self.kept}/{self.minAmount}{file}")
return self.kept>=self.minAmount