When importing a file to pandas, there are several options to skip headerlines etc, and specify what line should become the column headers and so on.
Is it possible to access these skipped lines for later use? In my case, some of those skipped lines contain useful metadata which aren't part of the 'proper' table, but contain things I'd still like to use.
For example, say I have this data (it's already in dataframe format from a previous script):
/path/to/the/originating/file
Position Start End Peptide Chou-Fasman Emini Kolaskar-Tongaonkar Parker
0 5 1 9 MSTSTSQIA 0.991 0.887 0.990 2.867
1 6 2 10 STSTSQIAV 0.980 0.665 1.052 2.922
2 7 3 11 TSTSQIAVE 0.903 0.860 1.034 3.067
3 8 4 12 STSQIAVEY 0.923 0.934 1.062 2.278
4 9 5 13 TSQIAVEYP 0.933 1.077 1.068 1.789
...
I can read this in like so:
df = pd.read_csv(sys.stdin, delim_whitespace=True, header=1, index_col=0)
The header=1 takes care of skipping the filepath line, but I'd still like to know what the file was. As you can see, I'm using this in a pipe, so that line allows me to 'track' the contents and the file they came from.
I'd still like to be able to access (in this case) the zero-th line though to do other things with it in the script. More generally, is there a way to access lines that pandas has skipped?
Since this is a pipe, reading the file again to get just that line isn't really an option. I could read the stream in to a StringIO object, get the first line, then pass the rest of it to pandas but that feels hacky.
Related
I am loading a csv file in order to read it in a dataframe. But, when I load it, I got the dataframe which was not separated.
The data in a CSV file is as below,
Name;OffsetBarrierX;MeanF;MaxF;Thickness
sim1;88.416;-120140.0402;453487.6875;2.4
sim2;-108.321;-119949.118;447437.9688;2.4
sim3;-89.221;-119324.1906;576758.625;2.4
sim4;-89.221;-119324.1906;576758.625;2.4
sim5;-87.488;-117527.1688;574904.5625;2.4
sim6;-11.1424;-183188.6846;763354;2.4
I wrote followig line of code,
path = '../00_Data/Res_Overl_A.csv'
pd.read_csv(path, sep = ';')
I got the following output,
Name;OffsetBarrierX;MeanF;MaxF;Thickness
0 sim1;88.416;-120140.0402;453487.6875;2.4
1 sim2;-108.321;-119949.118;447437.9688;2.4
2 sim3;-89.221;-119324.1906;576758.625;2.4
3 sim4;-87.488;-117527.1688;574904.5625;2.4
4 sim5;-11.1424;-183188.6846;763354;2.4
5 sim6;-82.713;-121320.2933;608878.1875;2.4
6 sim7;-27.194;-172708.102;747944.625;2.4
So as you can see, there is no change. I need an output which has different data in different column of dataframe. How can I make different columns for every data?
Couldn't reproduce OP's problem as when I create a csv with the data indicated above, pandas.read_csv reads without any issue
path = r'C:\Users\johndoe\Documents\Python\Challenges\StackOverflow\0108.csv'
df = pd.read_csv(path, sep=';')
[Out]:
Name OffsetBarrierX MeanF MaxF Thickness
0 sim1 88.4160 -120140.0402 453487.6875 2.4
1 sim2 -108.3210 -119949.1180 447437.9688 2.4
2 sim3 -89.2210 -119324.1906 576758.6250 2.4
3 sim4 -89.2210 -119324.1906 576758.6250 2.4
4 sim5 -87.4880 -117527.1688 574904.5625 2.4
5 sim6 -11.1424 -183188.6846 763354.0000 2.4
However, given OP's problem, one approach would be to store the content of the CSV in a variable. Let's call the variable csv
csv = """Name;OffsetBarrierX;MeanF;MaxF;Thickness
sim1;88.416;-120140.0402;453487.6875;2.4
sim2;-108.321;-119949.118;447437.9688;2.4
sim3;-89.221;-119324.1906;576758.625;2.4
sim4;-89.221;-119324.1906;576758.625;2.4
sim5;-87.488;-117527.1688;574904.5625;2.4
sim6;-11.1424;-183188.6846;763354;2.4"""
Now, StringIO should solve the problem
from io import StringIO
df = pd.read_csv(StringIO(csv), sep=';')
[Out]:
Name OffsetBarrierX MeanF MaxF Thickness
0 sim1 88.4160 -120140.0402 453487.6875 2.4
1 sim2 -108.3210 -119949.1180 447437.9688 2.4
2 sim3 -89.2210 -119324.1906 576758.6250 2.4
3 sim4 -89.2210 -119324.1906 576758.6250 2.4
4 sim5 -87.4880 -117527.1688 574904.5625 2.4
5 sim6 -11.1424 -183188.6846 763354.0000 2.4
Given a set of data that looks like the following, each line are 10 characters in length. They are links of a network, comprised of combinations of 4 or 5 character node numbers. Below is an example of the situations I would face:
|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|
As the dataset doesn't care much about spacing, While lines 1, 2 and 4 can be read in Pandas easily with either sep=' ' or delim_whitespace=True, I'm afraid I can't do the same for line 3. There is very little I can do to the input data file as it's generated from a third party software
(apart from doing some formatting in Excel, which seemed counterintuitive...) Please, is there something in Pandas allowing me to specify the number of characters (in my case, 5) as a delimiter?
Advice much appreciated.
I think what you're looking for is pd.read_fwf to read a fixed-width file. In this case, you would specify column specifications:
pd.read_fwf(io.StringIO('''|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|'''), colspecs=[(1, 6), (6, 11)], header=None)
The column specifications are 0-indexed and end-exclusive. You could also use the widths parameter, but I would avoid using it before stripping the | out to ensure that your variables are properly read in as numbers rather than as strings starting or ending with a pipe.
This would, in this case, produce:
0 1
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
I passed header=None due to the lack of a header in your sample data. You may need to adjust as needed. I also stripped out all the blank lines in your input. If there are in fact blank lines in the input, then I would first run: '\n'.join((s for s in input_string.split('\n') if len(s.strip()) != 0)) before passing it to be parsed. There, you would also need to first load the file as a string, clean it, and then pass it with io.StringIO to read_fwf.
With read_csv, you can specify the sep as a group of 4 or 5 digits, then keep only the columns with the numbers.
from io import StringIO
s = '''
|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|
'''
print(
pd.read_csv(StringIO(s), sep='(\d{4,5})',
engine='python', usecols=[1,3],
index_col=False, header=None)
)
1 3
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
Or you can just load the data and take the advance of the textwrap module just specify the width and It'll generate the columns for you.
import textwrap
df['<col_name>'].apply(textwrap.wrap, width = 5).apply(pd.Series)
OUTPUT:
0 1
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
I would just use df['my_col'].str[0:6] after reading it in as one column.
If your file is just this data then #ifly6's use of read_fwf is more appropriate. I just assumed that this is one column as part of a larger CSV in which case this is the approach I would use.
I have the attached file which I need to upload in python. I need to ignore NETSIM and 10 value on top and read the remaining. I used the following code to read the file in python:
import pandas as pd
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep=r'\\\t',engine='python',skiprows=(0,1,2), header=None)
I used the tab separator in my code but the output is still show me as follows:
0 0\t0.362291\t0.441396
1 1\t0.156279\t0.341383
2 2\t0.699696\t0.045577
3 3\t0.714313\t0.171668
4 4\t0.378966\t0.495494
5 5\t0.961942\t0.444337
6 6\t0.726886\t0.575888
7 7\t0.168639\t0.406223
8 8\t0.875627\t0.061439
9 9\t0.540054\t0.317061
10 5\t7\t155200000.000000\t54000000.000000\t37997...
11 3\t4\t155200000.000000\t40500000.000000\t24507...
12 4\t6\t155200000.000000\t33000000.000000\t18606...
13 5\t6\t155200000.000000\t72000000.000000\t39198...
14 4\t1\t155200000.000000\t40500000.000000\t24507...
15 3\t9\t155200000.000000\t39000000.000000\t22698...
Can someone please guide me as to what's wrong?
The attached file
You want to split on a literal tab character, not the string \\t, so you shouldn't be using a raw string literal here. Change sep to just '\t'.
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep='\t',engine='python',skiprows=(0,1,2), header=None)
I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array
I have data in a text file and I would like to be able to modify the file by columns and output the file again. I normally write in C (basic ability) but choose python for it's obvious string benefits. I haven't ever used python before so I'm a tad stuck. I have been reading up on similar problems but they only show how to change whole lines. To be honest I have on clue what to do.
Say I have the file
1 2 3
4 5 6
7 8 9
and I want to be able to change column two with some function say multiply it by 2 so I get
1 4 3
4 10 6
7 16 9
Ideally I would be able to easily change the program so I apply any function to any column.
For anyone who is interested it is for modifying lab data for plotting. eg take the log of the first column.
Python is an excellent general purpose language however I might suggest that if you are on an Unix based system then maybe you should take a look at awk. The language awk is design for these kind of text based transformation. The power of awk is easily seen for your question as the solution is only a few characters: awk '{$2=$2*2;print}'.
$ cat file
1 2 3
4 5 6
7 8 9
$ awk '{$2=$2*2;print}' file
1 4 3
4 10 6
7 16 9
# Multiple the third column by 10
$ awk '{$3=$3*10;print}' file
1 2 30
4 5 60
7 8 90
In awk each column is referenced by $i where i is the ith field. So we just set the value of second field to be the value of second field multiplied by two and print the line. This can be written even more concisely like awk '{$2=$2*2}1' file but best to be clear at beginning.
Here is a very simple Python solution:
for line in open("myfile.txt"):
col = line.strip().split(' ')
print col[0],int(col[1])*2,col[2]
There are plenty of improvements that could made but I'll leave that as an exercise for you.
I would use pandas or just numpy. Read your file with:
data = pd.read_csv('file.txt', header=None, delim_whitespace=True)
then work with the data in a spreadsheet like style, ex:
data.values[:,1] *= 2
finally write again to file with:
data.to_csv('output.txt')
As #sudo_O said, there are much efficient tools than python for this task. However,here is a possible solution :
from itertools import imap, repeat
import csv
fun = pow
with open('m.in', 'r') as input_file :
with open('m.out', 'wb') as out_file:
inpt = csv.reader(input_file, delimiter=' ')
out = csv.writer(out_file, delimiter=' ')
for row in inpt:
row = [ int(e) for e in row] #conversion
opt = repeat(2, len(row) ) # square power for every value
# write ( function(data, argument) )
out.writerow( [ str(elem )for elem in imap(fun, row , opt ) ] )
Here it multiply every number by itself, but you can configure it to multiply only the second colum, by changing opt : opt = [ 1 + (col == 1) for col in range(len(row)) ] (2 for col 1, 1 otherwise )