python read_csv custom separator - python

I try read csv and split data into 2 column. I try went with some regex separators like (?<=_.*)\s+ but python return "re.error: look-behind requires fixed-width pattern". other variants \s+(?![^_\S+]) give more than 2 columns.
Could someone help me find solution?
pd.read_csv('out.txt', header=None, sep=r"(?<=_.*)\s+", skiprows=2,
engine='python', keep_default_na=False)
_journal_issue 23
_journal_name_full 'Physical Chemistry and Chemical Physics'
_journal_page_first 9197
_journal_page_last 9204
_journal_paper_doi 10.1039/c3cp50853f
_journal_volume 15
_journal_year 2013
_chemical_compound_source 'Corrosion product'
_chemical_formula_structural 'Fe3 O4'
_chemical_formula_sum 'Fe3 O4'
_chemical_name_mineral Magnetite
_chemical_name_systematic 'Iron diiron(III) oxide'
_space_group_crystal_system cubic
_space_group_IT_number 227
_space_group_name_Hall 'F 4d 2 3 -1d'
_space_group_name_H-M_alt 'F d -3 m :1'
_cell_angle_alpha 90
_cell_angle_beta 90
_cell_angle_gamma 90
_cell_formula_units_Z 8
_cell_length_a 8.36
_cell_length_b 8.36
_cell_length_c 8.36
_raman_determination_method experimental
_[local]_chemical_compound_color Black
_[local]_chemical_compound_state Solid
_raman_measurement_device.location 'IMMM Maine university'
_raman_measurement_device.company 'HORIBA Jobin Yvon'
_raman_measurement_device.model T64000
_raman_measurement_device.optics_type microscope
_raman_measurement_device.microscope_system dispersive
_raman_measurement_device.microscope_objective_magnification 100
_raman_measurement_device.microscope_numerical_aperture 0.90
_raman_measurement_device.excitation_laser_type Argon-Krypton
_raman_measurement_device.excitation_laser_wavelength 514
_raman_measurement_device.configuration simple
_raman_measurement_device.resolution 3
_raman_measurement_device.power_on_sample 2
_raman_measurement_device.direction_polarization unoriented
_raman_measurement_device.spot_size 0.8
_raman_measurement_device.diffraction_grating 600
_raman_measurement.environment air
_raman_measurement.environment_details
_raman_measurement.temperature 300
_raman_measurement.pressure 100
_raman_measurement.background_subtraction no
_raman_measurement.background_subtraction_details
_raman_measurement.baseline_correction no
_raman_measurement.baseline_correction_details

Try
df = pd.read_csv(
"out.txt", header=None, delim_whitespace=True, quotechar="'", keep_default_na=False
)
Result for your sample:
0 1
0 _journal_issue 23
1 _journal_name_full Physical Chemistry and Chemical Physics
2 _journal_page_first 9197
...
46 _raman_measurement.background_subtraction_details
47 _raman_measurement.baseline_correction no
48 _raman_measurement.baseline_correction_details

As per pandas documentation, pd.read_csv, you can provide sep as only string and string like r"<String>" is usually used for Raw string.
What I would recommend is first loop through file and replace all delimiter to a common delimiter and then feed file to pandas read_csv.
Apparently, Above answer is not true. jjramsey from comment below has mentioned it perfectly how it's wrong.
In the Pandas documentation for read_csv(), it says, "separators longer than 1 character and different from \s+ will be interpreted as regular expressions." So, no separators are not always fixed strings. Also, r is a perfectly good prefix for strings used for regular expressions, since it avoids having to escape backslashes.

Related

pandas split string with $ special text style

I have an excel, the data has two $ , when I read it using pandas, it will convert them to a very special text style.
import pandas as pd
df = pd.DataFrame({ 'Bid-Ask':['$185.25 - $186.10','$10.85 - $11.10','$14.70 - $15.10']})
after pd.read_excel
df['Bid'] = df['Bid-Ask'].str.split('−').str[0]
above code doesn't work due to $ make my string into a special style text.the Split function doesn't work.
my expected result
Do not split. Using str.extract is likely the most robust:
df[['Bid', 'Ask']] = df['Bid-Ask'].str.extract(r'(\d+(?:\.\d+)?)\D*(\d+(?:\.\d+)?)')
Output:
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
There is a non-breaking space (\xa0) in your string. That's why the split doesn't work.
I copied the strings (of your df) one by one into an Excel file and then imported it with pd.read_excel.
The column looks like this then:
repr(df['Bid-Ask'])
'0 $185.25\xa0- $186.10\n1 $10.85\xa0- $11.10\n2 $14.70\xa0- $15.10\nName: Bid-Ask, dtype: object'
Before splitting you can replace that and it'll work.
df['Bid-Ask'] = df['Bid-Ask'].astype('str').str.replace('\xa0',' ', regex=False)
df[['Bid', 'Ask']] = df['Bid-Ask'].str.replace('$','', regex=False).str.split('-',expand = True)
print(df)
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
You have to use the lambda function and apply the method together to split the column values into two and slice the value
df['Bid'] = df['Bid-Ask'].apply(lambda x: x.split('-')[0].strip()[1:])
df['Ask'] = df['Bid-Ask'].apply(lambda x: x.split('-')[1].strip()[1:])
output:
Bid-Ask Bid Ask
0 185.25− 186.10 185.25 186.1
1 10.85− 11.10 10.85 11.1
2 14.70− 15.10 14.70 15.1

Pandas read_csv on file without space?

Given a set of data that looks like the following, each line are 10 characters in length. They are links of a network, comprised of combinations of 4 or 5 character node numbers. Below is an example of the situations I would face:
|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|
As the dataset doesn't care much about spacing, While lines 1, 2 and 4 can be read in Pandas easily with either sep=' ' or delim_whitespace=True, I'm afraid I can't do the same for line 3. There is very little I can do to the input data file as it's generated from a third party software
(apart from doing some formatting in Excel, which seemed counterintuitive...) Please, is there something in Pandas allowing me to specify the number of characters (in my case, 5) as a delimiter?
Advice much appreciated.
I think what you're looking for is pd.read_fwf to read a fixed-width file. In this case, you would specify column specifications:
pd.read_fwf(io.StringIO('''|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|'''), colspecs=[(1, 6), (6, 11)], header=None)
The column specifications are 0-indexed and end-exclusive. You could also use the widths parameter, but I would avoid using it before stripping the | out to ensure that your variables are properly read in as numbers rather than as strings starting or ending with a pipe.
This would, in this case, produce:
0 1
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
I passed header=None due to the lack of a header in your sample data. You may need to adjust as needed. I also stripped out all the blank lines in your input. If there are in fact blank lines in the input, then I would first run: '\n'.join((s for s in input_string.split('\n') if len(s.strip()) != 0)) before passing it to be parsed. There, you would also need to first load the file as a string, clean it, and then pass it with io.StringIO to read_fwf.
With read_csv, you can specify the sep as a group of 4 or 5 digits, then keep only the columns with the numbers.
from io import StringIO
s = '''
|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|
'''
print(
pd.read_csv(StringIO(s), sep='(\d{4,5})',
engine='python', usecols=[1,3],
index_col=False, header=None)
)
1 3
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
Or you can just load the data and take the advance of the textwrap module just specify the width and It'll generate the columns for you.
import textwrap
df['<col_name>'].apply(textwrap.wrap, width = 5).apply(pd.Series)
OUTPUT:
0 1
0 10637 4652
1 1038 1037
2 70612 19637
3 82004 2082
I would just use df['my_col'].str[0:6] after reading it in as one column.
If your file is just this data then #ifly6's use of read_fwf is more appropriate. I just assumed that this is one column as part of a larger CSV in which case this is the approach I would use.

Removing unwanted strings from numeric float pandas string column

I have a dataframe that I harvested from wikipedia that has lat long coords as a col, and I'm trying to remove occurences of strings between parens that occur in some of the rows but not all.
Sample:
25 53.74333, 91.38583
47 -10.167, 148.700 (Abau Airport)
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667 (Abdelhafid Boussouf Bou Ch...)
I've tried doing this
big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: float(x.replace('[^\d.]', '')))
Which throws this error, basically indicating that not all have characters to remove, which is fine but if I try implementing a for loop to use a try/catch then I'll have to map and in the case of this dataframe I don't have a unique ID to use as a key.
ValueError: could not convert string to float: '53.58472, 14.90222'
Removing the float cast and doing this:
big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: x.replace('[^\d.]', ''))
The code executes, but no changes are made to which I'm not sure why.
The expected output should look like this:
25 53.74333, 91.38583
47 -10.167, 148.700
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667
This is just a simple regex:
df.Lat_Lon.str.extract('^([-\d\.,\s]+)')
Output:
0
25 53.74333, 91.38583
47 -10.167, 148.700
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667
You can go as far as extracting both the latitude and longitude:
df.Lat_Lon.str.extract('^(?P<Lat>[-\d\.]+),\s(?P<Lon>[-\d\.]+)')
Output:
Lat Lon
25 53.74333 91.38583
47 -10.167 148.700
155 16.63611 -14.19028
414 49.02528 -122.36000
1 16.01111 43.17778
176 35.34167 1.46667
Instead of using python's str.replace use pandas DataFrame.replace with regex=True option. So instead your line should be:
big_with_l['Lat_Lon'] = big_with_ll['Lat_Lon'].replace(r'[^\d.]', '', regex=True)
Just a heads up, I assumed your regex string was well formed.

How to find multiple substrings between <> in one column in pandas data frame + python

I am using Pandas and Python. My data is:
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
I want to extract all the sub strings between < > and merge them with blank. For example, the above example should have the result:
aafae afre
433
1234334 a
bijf 9tu0 vie
nan
So all the sub strings between < > are extracted. There will be nan if no such strings. I have already tried re library and str functions. But I am really new to regex. Could anyone help me out here.
Use pandas.Series.str.findall:
a['Str'].str.findall('<(.*?)>').apply(' '.join)
Output:
0 aafae afre
1 433
2 1234334 a
3 bijf 9tu0 vie
4
Name: Str, dtype: object
Maybe, this expression might work somewhat and to some extent:
import pandas as pd
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
a["new_str"]=a["Str"].str.replace(r'.*?<([^>]+)>|(?:.+)', r'\1 ',regex=True)
print(a)

How do I create a pandas dataframe in python from a csv with additional delimiters?

I have a large csv (on the order of 400k rows) which I wish to turn into a dataframe in python. The original file has two columns: a text column, followed by an int (or NAN) column.
Example:
...
P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965
P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969
...
I wish to additionally split the text column into a series of columns following the pattern of the last three lines of the example text (P A 1 1017 02 D 01 368969, for example)
Noting that the text field can have varying formatting (P-X1 vs P-X-1), how might this best be accomplished?
First Attempt
The spec for read_csv indicates that it takes a regular expression, but this appears to be incorrect. After inspecting the source, it appears to merely takes a series of characters that it may use to populate a set of chars followed by +, so the below arguments to sep would be used to create a regex like
`[- ]+`.
Import necessary libs to recreate:
import pandas as pd
import StringIO
You can use aset of characters as delimiters, parsing the mismatched rows isn't possible with pd.read_csv, but if you want to parse them separately:
pd.read_csv(StringIO.StringIO('''P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965'''), sep=r'- ') # sep arg becomes regex, i.e. `[- ]+`
and
pd.read_csv(StringIO.StringIO('''P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969'''), sep=r'- ')
But read_csv is apparently unable to use real regular expressions for the separator.
Final Solution
That means we'll need a custom solution:
import re
import StringIO
import pandas as pd
txt = '''P-X1-6030-07-A01 368963
P-X1-6030-08-A01 368964
P-X1-6030-09-A01 368965
P-A-1-1011-14-G-01 368967
P-A-1-1014-01-G-05 368968
P-A-1-1017-02-D-01 368969'''
fileobj = StringIO.StringIO(txt)
def df_from_file(fileobj):
'''
takes a file object, returns DataFrame with columns grouped by
contiguous runs of either letters or numbers (but not both together)
'''
# unfortunately, we must materialize the data before putting it in the DataFrame
gen_records = [re.findall(r'(\d+|[A-Z]+)', line) for line in fileobj]
return pd.DataFrame.from_records(gen_records)
df = df_from_file(fileobj)
and now df returns:
0 1 2 3 4 5 6 7
0 P X 1 6030 07 A 01 368963
1 P X 1 6030 08 A 01 368964
2 P X 1 6030 09 A 01 368965
3 P A 1 1011 14 G 01 368967
4 P A 1 1014 01 G 05 368968
5 P A 1 1017 02 D 01 368969

Categories