Tab and line separation in python pandas - python

I have the attached file which I need to upload in python. I need to ignore NETSIM and 10 value on top and read the remaining. I used the following code to read the file in python:
import pandas as pd
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep=r'\\\t',engine='python',skiprows=(0,1,2), header=None)
I used the tab separator in my code but the output is still show me as follows:
0 0\t0.362291\t0.441396
1 1\t0.156279\t0.341383
2 2\t0.699696\t0.045577
3 3\t0.714313\t0.171668
4 4\t0.378966\t0.495494
5 5\t0.961942\t0.444337
6 6\t0.726886\t0.575888
7 7\t0.168639\t0.406223
8 8\t0.875627\t0.061439
9 9\t0.540054\t0.317061
10 5\t7\t155200000.000000\t54000000.000000\t37997...
11 3\t4\t155200000.000000\t40500000.000000\t24507...
12 4\t6\t155200000.000000\t33000000.000000\t18606...
13 5\t6\t155200000.000000\t72000000.000000\t39198...
14 4\t1\t155200000.000000\t40500000.000000\t24507...
15 3\t9\t155200000.000000\t39000000.000000\t22698...
Can someone please guide me as to what's wrong?
The attached file

You want to split on a literal tab character, not the string \\t, so you shouldn't be using a raw string literal here. Change sep to just '\t'.
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep='\t',engine='python',skiprows=(0,1,2), header=None)

Related

Access skipped lines from a pandas import

When importing a file to pandas, there are several options to skip headerlines etc, and specify what line should become the column headers and so on.
Is it possible to access these skipped lines for later use? In my case, some of those skipped lines contain useful metadata which aren't part of the 'proper' table, but contain things I'd still like to use.
For example, say I have this data (it's already in dataframe format from a previous script):
/path/to/the/originating/file
Position Start End Peptide Chou-Fasman Emini Kolaskar-Tongaonkar Parker
0 5 1 9 MSTSTSQIA 0.991 0.887 0.990 2.867
1 6 2 10 STSTSQIAV 0.980 0.665 1.052 2.922
2 7 3 11 TSTSQIAVE 0.903 0.860 1.034 3.067
3 8 4 12 STSQIAVEY 0.923 0.934 1.062 2.278
4 9 5 13 TSQIAVEYP 0.933 1.077 1.068 1.789
...
I can read this in like so:
df = pd.read_csv(sys.stdin, delim_whitespace=True, header=1, index_col=0)
The header=1 takes care of skipping the filepath line, but I'd still like to know what the file was. As you can see, I'm using this in a pipe, so that line allows me to 'track' the contents and the file they came from.
I'd still like to be able to access (in this case) the zero-th line though to do other things with it in the script. More generally, is there a way to access lines that pandas has skipped?
Since this is a pipe, reading the file again to get just that line isn't really an option. I could read the stream in to a StringIO object, get the first line, then pass the rest of it to pandas but that feels hacky.

Read Delimited File That Wraps Lines

I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array

Why does Pandas Series created from list appear enclosed with square brackets?

I'm trying to create a Series in Pandas from a list of dates presented as strings, thus:
['2016-08-09',
'2015-08-03',
'2017-08-15',
'2017-12-14',
...
but when I apply pd.Series from within the Pandas module the result in Jupyter notebook displays as:
0 [[[2016-08-09]]]
1 [[[2015-08-03]]]
2 [[[2017-08-15]]]
3 [[[2017-12-14]]]
...
Is there a simple way to fix it? The data has come from an Xml feed parsed using lxml.objectify.
I don't normally get these problems when reading from csv and just curious what I might be doing wrong.
UPDATE:
The code to grab the data and an example site:
import lxml.objectify
import pandas as pd
def parse_sitemap(url):
root = lxml.objectify.parse(url)
rooted = root.getroot()
output_1 = [child.getchildren()[0] for child in rooted.getchildren()]
output_0 = [child.getchildren()[1] for child in rooted.getchildren()]
return output_1
results = parse_sitemap("sitemap.xml")
pd.Series(results)
If you print out type(result[0]), you'll understand, it's not a string you get:
print(type(results[0]))
Output:
lxml.objectify.StringElement
This is not a string, and pandas doesn't seem to be playing nice with it. But the fix is easy. Just convert to string using pd.Series.astype:
s = pd.Series(results).astype(str)
print(s)
0 2017-08-09T11:20:38Z
1 2017-08-09T11:10:55Z
2 2017-08-09T15:36:20Z
3 2017-08-09T16:36:59Z
4 2017-08-02T09:56:50Z
5 2017-08-02T19:33:31Z
6 2017-08-03T07:32:24Z
7 2017-08-03T07:35:35Z
8 2017-08-03T07:54:12Z
9 2017-07-31T16:38:34Z
10 2017-07-31T15:42:24Z
11 2017-07-31T15:44:56Z
12 2017-07-31T15:23:25Z
13 2017-08-01T08:30:27Z
14 2017-08-01T11:01:57Z
15 2017-08-03T13:52:39Z
16 2017-08-03T14:29:55Z
17 2017-08-03T13:39:24Z
18 2017-08-03T13:39:00Z
19 2017-08-03T15:30:58Z
20 2017-08-06T11:29:24Z
21 2017-08-03T10:19:43Z
22 2017-08-14T18:42:49Z
23 2017-08-15T15:42:04Z
24 2017-08-17T08:58:19Z
25 2017-08-18T13:37:52Z
26 2017-08-18T13:38:14Z
27 2017-08-18T13:45:42Z
28 2017-08-03T09:56:42Z
29 2017-08-01T11:01:22Z
dtype: object
i think all you need to do is:
pd.Series(dates)
but there's not enough info in the question to say for sure.
additionally, if you want to use datetime64 objects, you can do:
pd.Series(pd.to_datetime(dates))

parse text file and generate new .csv file based on that data

I would like to parse a machine log file, re-arange the data and write it to a .csv file, which i will import into a google spreadsheet. Or write the data directly to the spreadsheet.
here is an example of how the log looks like:
39 14 15 5 2016 39 14 15 5 2016 0
39 14 15 5 2016 40 14 15 5 2016 0.609
43 14 15 5 2016 44 14 15 5 2016 2.182
the output should look like this:
start_date,start_time,end_time,meters
15/5/16,14:39,14:39,0
15/5/16,14:39,14:40,0.609
15/5/16,14:43,14:44,2.182
i wrote the following python code:
file = open("c:\SteelUsage.bsu")
for line in file.readlines():
print(line) #just for verification
line=line.strip()
position=[]
numbers=line.split()
for number in numbers:
position.append(number)
print(number)#just for verification
the idea is to save each number in a row to a list, then i can re-write the numbers in the right order according to their position.
for example: in row #1 the string "39" will have position 0, "14" pstion 1, etc.
but it seems the code i wrote stores each number as a new list, because when i change print(number) to print(number[0]), the code prints the first digit of each number, istead of printing the first number. (39)
where did i go wrong?
thank you
Do something like this. Write out to your csv file.
with open('c:\SteelUsage.bsu','r') as reader:
lines = reader.readlines()
for line in lines:
inp = [i for i in line.strip().split()]
out = '%s/%s/%s,%s:%s,%s:%s,%s' % (inp[2],inp[3],inp[4],inp[1],inp[0],inp[6],inp[5],inp[10])
print out

How do I create a pandas dataframe in python from a csv with additional delimiters?

I have a large csv (on the order of 400k rows) which I wish to turn into a dataframe in python. The original file has two columns: a text column, followed by an int (or NAN) column.
Example:
...
P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965
P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969
...
I wish to additionally split the text column into a series of columns following the pattern of the last three lines of the example text (P A 1 1017 02 D 01 368969, for example)
Noting that the text field can have varying formatting (P-X1 vs P-X-1), how might this best be accomplished?
First Attempt
The spec for read_csv indicates that it takes a regular expression, but this appears to be incorrect. After inspecting the source, it appears to merely takes a series of characters that it may use to populate a set of chars followed by +, so the below arguments to sep would be used to create a regex like
`[- ]+`.
Import necessary libs to recreate:
import pandas as pd
import StringIO
You can use aset of characters as delimiters, parsing the mismatched rows isn't possible with pd.read_csv, but if you want to parse them separately:
pd.read_csv(StringIO.StringIO('''P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965'''), sep=r'- ') # sep arg becomes regex, i.e. `[- ]+`
and
pd.read_csv(StringIO.StringIO('''P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969'''), sep=r'- ')
But read_csv is apparently unable to use real regular expressions for the separator.
Final Solution
That means we'll need a custom solution:
import re
import StringIO
import pandas as pd
txt = '''P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965
P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969'''
fileobj = StringIO.StringIO(txt)
def df_from_file(fileobj):
'''
takes a file object, returns DataFrame with columns grouped by
contiguous runs of either letters or numbers (but not both together)
'''
# unfortunately, we must materialize the data before putting it in the DataFrame
gen_records = [re.findall(r'(\d+|[A-Z]+)', line) for line in fileobj]
return pd.DataFrame.from_records(gen_records)
df = df_from_file(fileobj)
and now df returns:
0 1 2 3 4 5 6 7
0 P X 1 6030 07 A 01 368963
1 P X 1 6030 08 A 01 368964
2 P X 1 6030 09 A 01 368965
3 P A 1 1011 14 G 01 368967
4 P A 1 1014 01 G 05 368968
5 P A 1 1017 02 D 01 368969

Categories