Given a set of data that looks like the following, each line are 10 characters in length. They are links of a network, comprised of combinations of 4 or 5 character node numbers. Below is an example of the situations I would face:
|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|
As the dataset doesn't care much about spacing, While lines 1, 2 and 4 can be read in Pandas easily with either sep=' ' or delim_whitespace=True, I'm afraid I can't do the same for line 3. There is very little I can do to the input data file as it's generated from a third party software
(apart from doing some formatting in Excel, which seemed counterintuitive...) Please, is there something in Pandas allowing me to specify the number of characters (in my case, 5) as a delimiter?
Advice much appreciated.
I think what you're looking for is pd.read_fwf to read a fixed-width file. In this case, you would specify column specifications:
pd.read_fwf(io.StringIO('''|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|'''), colspecs=[(1, 6), (6, 11)], header=None)
The column specifications are 0-indexed and end-exclusive. You could also use the widths parameter, but I would avoid using it before stripping the | out to ensure that your variables are properly read in as numbers rather than as strings starting or ending with a pipe.
This would, in this case, produce:
0 1
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
I passed header=None due to the lack of a header in your sample data. You may need to adjust as needed. I also stripped out all the blank lines in your input. If there are in fact blank lines in the input, then I would first run: '\n'.join((s for s in input_string.split('\n') if len(s.strip()) != 0)) before passing it to be parsed. There, you would also need to first load the file as a string, clean it, and then pass it with io.StringIO to read_fwf.
With read_csv, you can specify the sep as a group of 4 or 5 digits, then keep only the columns with the numbers.
from io import StringIO
s = '''
|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|
'''
print(
pd.read_csv(StringIO(s), sep='(\d{4,5})',
engine='python', usecols=[1,3],
index_col=False, header=None)
)
1 3
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
Or you can just load the data and take the advance of the textwrap module just specify the width and It'll generate the columns for you.
import textwrap
df['<col_name>'].apply(textwrap.wrap, width = 5).apply(pd.Series)
OUTPUT:
0 1
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
I would just use df['my_col'].str[0:6] after reading it in as one column.
If your file is just this data then #ifly6's use of read_fwf is more appropriate. I just assumed that this is one column as part of a larger CSV in which case this is the approach I would use.
Related
Related to, but distinct from, this question.
I want to output my pandas dataframe to a tsv file. The first column of my data is a pattern that actually contains 3 bits of information which I'd like to separate into their own columns:
Range c1
chr1:2953-2965 -0.001069
chr1:35397-35409 -0.001050
chr1:37454-37466 -0.001330
chr2:37997-38009 -0.001235
chrX:44465-44477 -0.001292
So I do this:
Df = Df.reset_index()
Df["Range"] = Df["Range"].str.replace( ":", "\t" ).str.replace( "-", "\t" )
Df
Range c1
0 chr1\t2953\t2965 -0.001069
1 chr1\t35397\t35409 -0.001050
2 chr1\t37454\t37466 -0.001330
3 chr2\t37997\t38009 -0.001235
4 chrX\t44465\t44477 -0.001292
All I need to do now is output with no header or index, and add one more '\t' to separate the last column and I'll have my 4-column output file as desired. Unfortunately...
Df.to_csv( "~/testout.bed",
header=None,
index=False,
sep="\t",
quoting=csv.QUOTE_NONE,
quotechar=""
)
Error: need to escape, but no escapechar set
Here is where I want to ignore this error and say "No, python, actually you Don't need to escape anything. I put those tab characters in there specifically to create column separators."
I get why this error occurs. Python thinks I forgot about those tabs, and this is a safety catch, but actually I didn't forget about anything and I know what I'm doing. I know that the tab characters in my data will be indistinguishable from column-separators, and that's exactly what I want. I put them there specifically for this reason.
Surely there must be some way to override this, no? Is there any way to ignore the error and force the output?
You can simply use str.split to split the Range column directly -
df['Range'].str.split(r":|-", expand=True)
# 0 1 2
#0 chr1 2953 2965
#1 chr1 35397 35409
#2 chr1 37454 37466
#3 chr2 37997 38009
#4 chrX 44465 44477
To retain all the columns, you can simply join this split with the original
df = df.join(df['Range'].str.split(r":|-", expand=True))
I try read csv and split data into 2 column. I try went with some regex separators like (?<=_.*)\s+ but python return "re.error: look-behind requires fixed-width pattern". other variants \s+(?![^_\S+]) give more than 2 columns.
Could someone help me find solution?
pd.read_csv('out.txt', header=None, sep=r"(?<=_.*)\s+", skiprows=2,
engine='python', keep_default_na=False)
_journal_issue 23
_journal_name_full 'Physical Chemistry and Chemical Physics'
_journal_page_first 9197
_journal_page_last 9204
_journal_paper_doi 10.1039/c3cp50853f
_journal_volume 15
_journal_year 2013
_chemical_compound_source 'Corrosion product'
_chemical_formula_structural 'Fe3 O4'
_chemical_formula_sum 'Fe3 O4'
_chemical_name_mineral Magnetite
_chemical_name_systematic 'Iron diiron(III) oxide'
_space_group_crystal_system cubic
_space_group_IT_number 227
_space_group_name_Hall 'F 4d 2 3 -1d'
_space_group_name_H-M_alt 'F d -3 m :1'
_cell_angle_alpha 90
_cell_angle_beta 90
_cell_angle_gamma 90
_cell_formula_units_Z 8
_cell_length_a 8.36
_cell_length_b 8.36
_cell_length_c 8.36
_raman_determination_method experimental
_[local]_chemical_compound_color Black
_[local]_chemical_compound_state Solid
_raman_measurement_device.location 'IMMM Maine university'
_raman_measurement_device.company 'HORIBA Jobin Yvon'
_raman_measurement_device.model T64000
_raman_measurement_device.optics_type microscope
_raman_measurement_device.microscope_system dispersive
_raman_measurement_device.microscope_objective_magnification 100
_raman_measurement_device.microscope_numerical_aperture 0.90
_raman_measurement_device.excitation_laser_type Argon-Krypton
_raman_measurement_device.excitation_laser_wavelength 514
_raman_measurement_device.configuration simple
_raman_measurement_device.resolution 3
_raman_measurement_device.power_on_sample 2
_raman_measurement_device.direction_polarization unoriented
_raman_measurement_device.spot_size 0.8
_raman_measurement_device.diffraction_grating 600
_raman_measurement.environment air
_raman_measurement.environment_details
_raman_measurement.temperature 300
_raman_measurement.pressure 100
_raman_measurement.background_subtraction no
_raman_measurement.background_subtraction_details
_raman_measurement.baseline_correction no
_raman_measurement.baseline_correction_details
Try
df = pd.read_csv(
"out.txt", header=None, delim_whitespace=True, quotechar="'", keep_default_na=False
)
Result for your sample:
0 1
0 _journal_issue 23
1 _journal_name_full Physical Chemistry and Chemical Physics
2 _journal_page_first 9197
...
46 _raman_measurement.background_subtraction_details
47 _raman_measurement.baseline_correction no
48 _raman_measurement.baseline_correction_details
As per pandas documentation, pd.read_csv, you can provide sep as only string and string like r"<String>" is usually used for Raw string.
What I would recommend is first loop through file and replace all delimiter to a common delimiter and then feed file to pandas read_csv.
Apparently, Above answer is not true. jjramsey from comment below has mentioned it perfectly how it's wrong.
In the Pandas documentation for read_csv(), it says, "separators longer than 1 character and different from \s+ will be interpreted as regular expressions." So, no separators are not always fixed strings. Also, r is a perfectly good prefix for strings used for regular expressions, since it avoids having to escape backslashes.
This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']
I have a CSV file which is very messy in terms of column and row alignment. In the first cell, all column names are stated, but they do not align with the rows beneath. So when I load this CSV in python using pandas
I do not get a clean dataframe
In the below picture, there is an example of how it should look like with the columns separated and matching the rows.
Some details:
Few lines of raw CSV file:
Columns:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu"
Rows:
ITLT4301;1;"1-5-2018";976439;35059255;53842;6545371441;3235864;95200029;"MemActive";"4096";"0";"0"
Code:
df = pd.read_csv(file_location, sep=";")
Output when loading the dataframe in python:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu",,,
ITLT4301;1;"1-5-2018";976439,35059255 53842,6545371441 3235864,"95200029 MemActive"" 4096"" 0"" 0"""
Desired output:
VMName Cluster time AvgValue MinValue MaxValue MetricId MemoryMB CpuMHz
ITLT4301 1 1-5-201 976439 35059255 53842 6545371441 95200029 MemActive
NumCpu
4096
Hopefully this clears up the topic and problem a bit. Desired output is a well-organized data frame where the columns match the rows based on separater sign ";"
Your input data file is not a standard csv file. The correct way would be to fix the previous step in order to get a normal csv file instead of a mess of double quotes preventing any decent csv parser to correctly extract data.
As a workaround, it is possible to remove the initial and terminating double quote, remove any doubled double quote, and split every line on semi-column ignoring any remaining double quote. Optionnaly, you could also try to just remove any double quote and split the lines on ';'. It really depends on what values you expect.
A possible code could be:
def split_line(line):
'''split a line on ; after stripping white spaces, the initial and terminating "
doubles double quotes are also removed'''
return line.strip()[1:-1].replace('""', '').split(';')
with open('file.dat') as fd:
cols = split_line(next(fd)) # extract column names from header line
data = [split_line(line) for line in fd] # process data lines
df = pd.DataFrame(data, columns=cols) # build a dataframe from that
With that input:
"VMName;""Cluster"";""time"";""AvgValue"";""MinValue"";""MaxValue"";""MetricId"";""MemoryMB"";""CpuMHz"";""NumCpu"""
"ITLT4301;1;""1-5-2018"";976439" 35059255;53842 6545371441;3235864 "95200029;""MemActive"";""4096"";""0"";""0"""
"ITLT4301;1;""1-5-2018"";98" 9443749608104;29 3435452286154;673 "067568681366;""CpuUsageMHz"";""0"";""5600"";""2"""
It gives:
VMName Cluster time AvgValue MinValue \
0 ITLT4301 1 1-5-2018 976439" 35059255 53842 6545371441
1 ITLT4301 1 1-5-2018 98" 9443749608104 29 3435452286154
MaxValue MetricId MemoryMB CpuMHz NumCpu
0 3235864 "95200029 MemActive 4096 0 0
1 673 "067568681366 CpuUsageMHz 0 5600 2
today I am struggling with an interesting warnings:
parsers.py:1139: DtypeWarning: Columns (1,4) have mixed types. Specify dtype option on import or set low_memory=False.
Let's start from the beginning, I have several files with thousands of lines each, the content of each file looks like this:
##ID ChrA StartA EndA ChrB StartB EndB CnvType Orientation GeneA StrandA LastExonA TotalExonsA PhaseA GeneB StrandB LastExonB TotalExonsB PhaseB InFrame InPhase
nsv871164 1 8373207 8373207 1 8436802 8436802 DELETION HT ? ? ? ? ? RERE - 14 24 0 Not in Frame
dgv1n68 1 16765770 16765770 1 16936692 16936692 DELETION HT ? ? ? ? ? NBPF1 - 2 29 -1 Not in Frame
nsv9213 1 16777016 16777016 1 16779533 16779533 DELETION HT NECAP2 + 6 8 0 NECAP2 + 6 8 1 In Frame Not in Phase
.....
nsv510572 Y 16898737 16898737 Y 16904738 16904738 DELETION HT NLGN4Y + 4 6 1 NLGN4Y + 3 6 1 In Frame In Phase
nsv10042 Y 59192042 59192042 Y 59196197 59196197 DELETION HT ? ? ? ? ? ? ? ? ? ? ?
column[1] and column[4] refers to "Human Chromosomes" and are supposed to be 1 to 22 then X and Y.
Some files are short (2k lines) some are very long (200k lines).
If I make a pandas.Dataframe out of a short file, then no problem, the parser correctly recognizes the items in columns[1] and [4] as 'string'.
But if the file is long enough, the parser assigns 'int' until a certain point and then 'string' as soon it encounters 'X' or 'Y'.
At this point I got the warnings.
I think that is happening because the parser loads in memory a limited number of rows, then checks the best type to assign considering all the values of a column and then it goes on parsing the rest of the file.
Now, if all the rows can be parsed at once, then there are no mistakes, the parser recognizes all the values at once [1,2,3,4...,'X','Y'] and assign the best type (in this case 'str').
If the number of rows is too big, then the file is parsed in pieces and in my case the first piece contains only [1,2,3,4] and the parser assigns 'int'.
This, of course, is messing up my pipeline..
How can I force the parser to assign ONLY to column[1] and [4] the type 'str'?
This is the code I use to make Dataframes out of my files:
dataset = pandas.io.parsers.read_table(my_file, sep='\t', index_col=0)
You can set the dtypes of the columns as a param to read_csv so if you know the columns then just pass a dict with column names as the key and dtype as the value, for example:
dataset = pandas.io.parsers.read_table(my_file, sep='\t', index_col=0, dtype={'ChrA':'str'})
Just keep adding additional column names to the dict.