how to read .SDL text file? - python

I have a .SDL Text file in format
244455|199|6577888|20210401|138.61|0.78|83.16|0.00|0.00|221.77|6|0.00|17000
is there any python library to read and interpret such .SDL text file?

I am assuming that there will be no multiple line in the file.
data.sdl
490797|C|64||BLAH BLAH BLAH||||0|190/0000/07|A|1998889|198666566|||8990900|BLAGHHH72|L78899|||0040|012|432565|012|435659||MBLAHAHAHAHASIE|2WES|ARGHKKHHHT|PRE||0002|012|432565|012|435659||MR. JOHN DOE|PO BOX 198898|SILUHHHHH||0052|661|13||82110|35000000|2|0|||||0|0||||Y||70877746414|R
Python script to extract data in a list:
data_list = []
# with open('path/to/file.sdl') as file
with open('data.sdl', 'r') as file:
data = file.read()
data_list = data.split('|')
data_list[-1] = data_list[-1].strip()
data_list = list(filter(None, data_list))
Output:
['490797', 'C', '64', 'BLAH BLAH BLAH', '0', '190/0000/07', 'A', '1998889', '198666566', '8990900', 'BLAGHHH72', 'L78899', '0040', '012', '432565', '012', '435659', 'MBLAHAHAHAHASIE', '2WES', 'ARGHKKHHHT', 'PRE', '0002', '012', '432565', '012', '435659', 'MR. JOHN DOE', 'PO BOX 198898', 'SILUHHHHH', '0052', '661', '13', '82110', '35000000', '2', '0', '0', '0', 'Y', '70877746414', 'R']
Please let me know if you need anything else.

Presuming there's more rows than you've provided in the same format, Pandas .read_csv() will be able to load this up for you!
import pandas as pd
df = pd.read_csv("my_path/whateverfilename.sdl", sep="|")
This will create a DataFrame object for you, which may be what you're after
If you just wanted each row as a list, you can simply load the file and .split() each line, though this will probably be harder to work with overall
split_lines = []
with open("my_path/whateverfilename.sdl") as fh:
for line in fh: # file-like objects are iterable by-line
split_lines.append(line.split("|"))

Assuming that each line as the same amount of columns:
File './path_to_data':
244455|199|6577888|20210401|138.61|0.78|83.16|0.00|0.00|221.77|6|0.00|17000
||||0||0|| , |C|64||
Data "reader":
import numpy as np
path = './path_to_data'
N_COLS = 13
# declare the data type of each column - in this case python Object
dts = np.dtype(', '.join(['O'] * N_COLS))
data = np.loadtxt(fname=path, delimiter='|', dtype=dts, unpack=False, skiprows=0, max_rows=None)
for i in data:
print(i)
Output
('244455', '199', '6577888', '20210401', '138.61', '0.78', '83.16', '0.00', '0.00', '221.77', '6', '0.00', '17000')
('', '', '', '', '0', '', '0', '', ' , ', 'C', '64', '', '')
To get the data as column: unpack=True
Tell form which line start to read skiprows=0
End reading at line max_rows=None if None read everything (default).
Here the doc.

Related

Unable to create a pandas dataframe from a json list due to presence of a colon in one of the values

The code below works fine with other list objects, but here due to the presence of a colon in imageURL, it's giving me an error. I have to load the data dynamically without looking at the particular key value pair. Please help.
dt=[{'lineno': '3544', 'sku': 'B2039P015DP', 'status': 'Shipped', 'order_qty': '4', 'openQty': '0', 'wipQty': '0', 'shippedQty': '2', 'closedQty': '0', 'closed_date': '', 'returnedQty': '0', 'deliveredQty': '0', 'imageUrl': 'https://d2p3w.cloudfront.net/pub/media/catalog/product/b/2/b2039p010ds.jpg', 'itemName': 'Primo Brown Cube Box, 5Ply, (20"x10"x10"), Pack of 15', 'price': '1033.76000', 'udf1': None, 'udf2': None, 'udf3': None, 'udf4': None, 'udf5': None, 'internalLineNo': '1'}]
dummy = pd.read_json(json.dumps(dt),orient='records')
Just use json.loads to load it rather than pd.read_json.
So with your input dt this code works fine:
dummy = pd.DataFrame(json.loads(json.dumps(dt)))

Python: Reading a Dataframe

I'm trying to read through a dataframe row by row, and grab what I want out of said row by index. The csv i'm reading in looks something like this...
but when I read it in, and run this code on it...
def sendKafkaMessagesTest(df):
df.columns = ['Platform_Name', 'Index', 'Type', 'Weapon', 'Munitions', 'Location', 'Tracks', 'Time']
for ind in df.index:
data = {'platform_name': str(df['Platform_Name'][ind]),
'tracks': str(df['Tracks'][ind]), 'time': str(df['Time'][ind])}
print(data)
producer.send('numtest', data)
It produces this... {'platform_name': '540', 'tracks': '0', 'time': 'nan'}
I tried changing the columns which I thought would work, but still a no go. It's like it's not considering Row A to be part of the data or something. Any ideas?
EDIT: Reading CSV file as df = pd.read_csv(event.src_path)
EDIT: Expected output is {'platform_name': 'TSC2_commander', 'tracks': '0', 'time': '0'}

Get numerical values from document in python

I've extracted the details of a docx file using this code
from docx import Document
document = Document('136441742-Rental-Agreement-Format.pdf.docx')
for para in document.paragraphs:
print(para.text)
The output contains numerical values, date and text fields. How to extract numerical values and dates ??
Using the document output you had shared in the comment, using the data as a string, and assuming the date format is dd.mm.yyy, and does not changes, I wrote the below code to get the date and numerical values, and it works fine for me.
I am using regular expression to extract date and isdigit() to get numerical values.
you could adopt the below code to work on your exact document output if needed.
import re
from datetime import datetime
text = "TENANCY AGREEMENT This Tenancy Agreement is made and executed at Bangalore on this 22.01.2013 by MR .P .RAJA SEKHAR AGED ABOUT 28 YRS S/0.MR.KRISHNA PARAMATMA PENTAKOTA R/at NESTER RAGA B-502, OPP MORE MEGA STORE BANGALORE-560 048 Hereinafter called the 'OWNER' of the One Part. AND MR.VENKATA BHYRAVA MURTHY MUTNURI & P/at NO.17-2-16, l/EERABHARAPURAM AGED ABOUT 26 YRS RAOAHMUNDRY ANDHRA PRADESH S/n.MR.RAGHAVENDRA RAO 533105"
a=[]
match = re.search(r'\d{2}.\d{2}.\d{4}', text)
date = datetime.strptime(match.group(), '%d.%m.%Y').date()
print(date)
for i in text :
if i.isdigit() == True:
a.append(i)
print(a)
Output -
2013-01-22
['2', '2', '0', '1', '2', '0', '1', '3', '2', '8', '0', '5', '0', '2', '5', '6', '0', '0', '4', '8', '1', '7', '2', '1', '6', '2', '6', '5', '3', '3', '1', '0', '5']
You can find the numbers using regex
\d{2}\.\d{2}\.\d{4} this will find the date
\d+-\d+-\d+ this will find plot number
\d{3} ?\d{3} this will find pincodes
\d+ this will find all other numbers
To find underline text you can use docx run.underline property
for para in Document('test.docx').paragraphs:
nums = re.findall('\d{2}\.\d{2}\.\d{4}|\d+-\d+-\d+|\d{3} ?\d{3}|\d+', para.text)
underline_text = [run.text for run in para.runs if run.underline]

reading CSV file and inserting it into 2d list in python

I want to insert the data of CSV file (network data such as: time, IP address, port number) into 2D list in Python.
Here is the code:
import csv
datafile = open('a.csv', 'r')
datareader = csv.reader(datafile, delimiter=';')
data = []
for row in datareader:
data.append(row)
print (data[1:4])
the result is:
[['1', '6', '192.168.4.118', '1605', '', '115.85.145.5', '80', '', '60', '0.000000000', '0x0010', 'Jun 15, 2010 18:27:57.490835000', '0.000000000'],
['2', '6','115.85.145.5', '80', '', '192.168.4.118', '1605', '', '1514', '0.002365000', '0x0010', 'Jun 15, 2010 18:27:57.493200000', '0.002365000'],
['3', '6', '115.85.145.5', '80', '', '192.168.4.118', '1605', '', '1514', '0.003513000', '0x0018', 'Jun 15, 2010 18:27:57.496713000', '0.005878000']]
But it is just one dimension and I don't know how to create 2D array and insert each element into the array.
Please suggest me what code should I use for this purpose. (I looked the previous hints in the website but none of them worked for me)
Here's a clean way to get a 2D array from a CSV that works with older Python versions too:
import csv
data = list(csv.reader(open(datafile)))
print(data[1][4])
You already have list of lists, which is sort of 2D array and you can address it like one data[1][1], etc.
That is a 2D array!
Can index it like this:
data[row][value]
For example, do get the IP address of the second line in your CSV:
data[1][2]
say your indicies for ip_address, time, port are
ip_address = 2
time = 3
port = 11
print [[item[ip_address], item[time], item[port]] for item in data]
output
[['192.168.4.118', '1605', 'Jun 15, 2010 18:27:57.490835000'],
['115.85.145.5', '80', 'Jun 15, 2010 18:27:57.493200000'],
['115.85.145.5', '80', 'Jun 15, 2010 18:27:57.496713000']]
you can do this when appending rows into data itself
for row in data reader:
data.append([row[ip_address], row[time], row[port]])

using urllib to import formatted text file with lines out of column

I'm trying to use urllib to parse a text file from the website and pull in data. There are other files that I have been able to do, they're text formatted in columns, but this one is kind of throwing me because of the line for Southern Illinois-Edwardsville pushes the second score and location out of the column.
file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')
for line in file:
game_month = line[0:1].rstrip()
game_day = line[2:4].rstrip()
game_year = line[5:9].rstrip()
team1 = line[11:37].rstrip()
team1_scr = line[38:40].rstrip()
team2 = line[42:68].rstrip()
team2_scor = line[68:70].rstrip()
extra_info = line[72:100].rstrip()
The Southern Illinois-Edwardsville line imports 'il' as team2_scr and imports ' 4 #Central Arkansas' as the extra_info.
Wanna see the best solution? http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=CSV&submit=Fetch will give you nice CSV file, no dark magic needed.
do you want something like this:
def get_row(row):
row=row.split()
num_pos=[]
for i in range(len(row)):
try:
int(row[i])
num_pos.append(i)
except:
pass
assert(len(num_pos)==2)
ans=[]
ans.append(row[0])
ans.append("".join(row[1:num_pos[0]]))
ans.append(int(row[num_pos[0]]))
ans.append("".join(row[num_pos[0]+1:num_pos[1]]))
ans.append(int(row[num_pos[1]]))
ans.append("".join(row[num_pos[1]+1:]))
return ans
row1="2/18/2011 Central Arkansas 5 Southern Illinois-Edwardsville 4 #Central Arkansas"
row2="2/18/2011 Central Florida 11 Siena 1 #Central Florida"
print get_row(row1)
print get_row(row2)
output:
['2/18/2011', 'CentralArkansas', 5, 'SouthernIllinois-Edwardsville', 4, '#CentralArkansas']
['2/18/2011', 'CentralFlorida', 11, 'Siena', 1, '#CentralFlorida']
Clearly you just need to split on multiple spaces. Unfortunately the csv module only allows a single-character delimiter, but re.sub can help. I would recommend something like this:
import urllib2
import csv
import re
u = urllib2.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')
reader = csv.DictReader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t', fieldnames=('date', 'team1', 'team1_score', 'team2', 'team2_score', 'extra_info'))
for i, row in enumerate(reader):
if i == 5: break # Only do five (otherwise you don't need ``enumerate()``)
print row
This produces results like this:
{'team1': 'Air Force', 'team2': 'Missouri State', 'date': '2/18/2011', 'team2_score': '2', 'team1_score': '7', 'extra_info': '#neutral'}
{'team1': 'Akron', 'team2': 'Lamar', 'date': '2/18/2011', 'team2_score': '1', 'team1_score': '2', 'extra_info': '#neutral'}
{'team1': 'Alabama', 'team2': 'Alcorn State', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '11', 'extra_info': '#Alabama'}
{'team1': 'Alabama State', 'team2': 'Tuskegee', 'date': '2/18/2011', 'team2_score': '5', 'team1_score': '9', 'extra_info': '#Alabama State'}
{'team1': 'Appalachian State', 'team2': 'Maryland-Eastern Shore', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '4', 'extra_info': '#Appalachian State'}
Or if you prefer, just use a cvs.reader and get lists rather than dicts:
reader = csv.reader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t')
print reader.next()
Say that s contains one row of your table. Then you could use the split() method of the re (regular expressions) library:
import re
rexp = re.compile(' +') # Match two or more spaces
cols = rexp.split(s)
...and cols is now a list of strings, each a column in your table row. This assumes that table columns are separated by at least two spaces, and nothing else. If that is not the case, the argument to re.compile() can be edited to allow for other configurations.
Recall that Python considers a file a sequence of lines, separated by newline characters. Therefore, all you have to do is to for-loop over your file, applying .split() to each line.
For an even nicer solution, check out the built-in map() function and try using that instead of a for-loop.

Categories