Removing unwanted strings from numeric float pandas string column - python

I have a dataframe that I harvested from wikipedia that has lat long coords as a col, and I'm trying to remove occurences of strings between parens that occur in some of the rows but not all.
Sample:
25 53.74333, 91.38583
47 -10.167, 148.700 (Abau Airport)
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667 (Abdelhafid Boussouf Bou Ch...)
I've tried doing this
big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: float(x.replace('[^\d.]', '')))
Which throws this error, basically indicating that not all have characters to remove, which is fine but if I try implementing a for loop to use a try/catch then I'll have to map and in the case of this dataframe I don't have a unique ID to use as a key.
ValueError: could not convert string to float: '53.58472, 14.90222'
Removing the float cast and doing this:
big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: x.replace('[^\d.]', ''))
The code executes, but no changes are made to which I'm not sure why.
The expected output should look like this:
25 53.74333, 91.38583
47 -10.167, 148.700
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667

This is just a simple regex:
df.Lat_Lon.str.extract('^([-\d\.,\s]+)')
Output:
0
25 53.74333, 91.38583
47 -10.167, 148.700
155 16.63611, -14.19028
414 49.02528, -122.36000
1 16.01111, 43.17778
176 35.34167, 1.46667
You can go as far as extracting both the latitude and longitude:
df.Lat_Lon.str.extract('^(?P<Lat>[-\d\.]+),\s(?P<Lon>[-\d\.]+)')
Output:
Lat Lon
25 53.74333 91.38583
47 -10.167 148.700
155 16.63611 -14.19028
414 49.02528 -122.36000
1 16.01111 43.17778
176 35.34167 1.46667

Instead of using python's str.replace use pandas DataFrame.replace with regex=True option. So instead your line should be:
big_with_l['Lat_Lon'] = big_with_ll['Lat_Lon'].replace(r'[^\d.]', '', regex=True)
Just a heads up, I assumed your regex string was well formed.

Related

splitting html tag data to multiple coluumn

Actually I am stuck at a issue, my data is in the format given in below image,
Splitting data to multiple column
Is there any way in python dataframe to segregate this data to multiple column example,
Data in required format
Can anyone help me out.
I have tried to split it but it does not work,
df3=df.technische_daten.str.split('\s+(?=\<\/[a-z]+\>)', expand=True)
df3[0]=df3[0].str.replace(r'<li>', '', regex=False)
df3[1]=df3[1].str.replace(r'</li> ','',regex=True)
Data Snippet:
<ul><li>Höhe: 248 mm</li><li>Länge: 297 mm</li><li>Breite: 246 mm</li><li>Gewicht: 4,0 kg</li><li>Leerlaufdrehzahl: 5500 U/min</li><li>Sägeblattdurchmesser: 190 mm</li><li>Leistungsaufnahme: 1400 Watt</li><li>Standard: 821552-6,B-02939,195837-9,164095-8</li><li>Bohrung: 30 mm</li><li>Schnittleistung 45°: 48,5 mm</li><li>Vibration Sägen Holz: 2,5 m/s²</li><li>Schnittleistung 0°: 67 mm</li><li>Sägeblatt-Ø / Bohrung: 190/30 mm</li><li>Max. Schnitttiefe 90°: 67 mm</li><li>Schnittleistung 0°/45°: 67/48,5 mm</li></ul>
Helpfully it's a built in function of pandas, which will make a HTML table for you.
import pandas as pd
df_rows = []
# put the below in a for loop to get all of your rows
# rows = all_your_data
# for row in rows:
# remove this line and use the above for loop
row = "<ul><li>Höhe: 248 mm</li><li>Länge: 297 mm</li><li>Breite: 246 mm</li><li>Gewicht: 4,0 kg</li><li>Leerlaufdrehzahl: 5500 U/min</li><li>Sägeblattdurchmesser: 190 mm</li><li>Leistungsaufnahme: 1400 Watt</li><li>Standard: 821552-6,B-02939,195837-9,164095-8</li><li>Bohrung: 30 mm</li><li>Schnittleistung 45°: 48,5 mm</li><li>Vibration Sägen Holz: 2,5 m/s²</li><li>Schnittleistung 0°: 67 mm</li><li>Sägeblatt-Ø / Bohrung: 190/30 mm</li><li>Max. Schnitttiefe 90°: 67 mm</li><li>Schnittleistung 0°/45°: 67/48,5 mm</li></ul>"
values = row.split("</li><li>")
# clean the data
values[0] = values[0].replace("<ul><li>", "")
values[-1] = values[-1].replace("</li></ul>", "")
dict_of_values = {}
for value in values:
dict_of_values[value.split(": ")[0]] = value.split(": ")[1]
df_rows.append(dict_of_values)
# outside of for loop
df = pd.DataFrame.from_dict(df_rows, orient='columns')
# use df.drop to remove any columns you do not need
df = df.drop(['Leerlaufdrehzahl', 'Sägeblattdurchmesser'], axis=1)
your_html = df.to_html()
Hopefully this helps.

python read_csv custom separator

I try read csv and split data into 2 column. I try went with some regex separators like (?<=_.*)\s+ but python return "re.error: look-behind requires fixed-width pattern". other variants \s+(?![^_\S+]) give more than 2 columns.
Could someone help me find solution?
pd.read_csv('out.txt', header=None, sep=r"(?<=_.*)\s+", skiprows=2,
engine='python', keep_default_na=False)
_journal_issue 23
_journal_name_full 'Physical Chemistry and Chemical Physics'
_journal_page_first 9197
_journal_page_last 9204
_journal_paper_doi 10.1039/c3cp50853f
_journal_volume 15
_journal_year 2013
_chemical_compound_source 'Corrosion product'
_chemical_formula_structural 'Fe3 O4'
_chemical_formula_sum 'Fe3 O4'
_chemical_name_mineral Magnetite
_chemical_name_systematic 'Iron diiron(III) oxide'
_space_group_crystal_system cubic
_space_group_IT_number 227
_space_group_name_Hall 'F 4d 2 3 -1d'
_space_group_name_H-M_alt 'F d -3 m :1'
_cell_angle_alpha 90
_cell_angle_beta 90
_cell_angle_gamma 90
_cell_formula_units_Z 8
_cell_length_a 8.36
_cell_length_b 8.36
_cell_length_c 8.36
_raman_determination_method experimental
_[local]_chemical_compound_color Black
_[local]_chemical_compound_state Solid
_raman_measurement_device.location 'IMMM Maine university'
_raman_measurement_device.company 'HORIBA Jobin Yvon'
_raman_measurement_device.model T64000
_raman_measurement_device.optics_type microscope
_raman_measurement_device.microscope_system dispersive
_raman_measurement_device.microscope_objective_magnification 100
_raman_measurement_device.microscope_numerical_aperture 0.90
_raman_measurement_device.excitation_laser_type Argon-Krypton
_raman_measurement_device.excitation_laser_wavelength 514
_raman_measurement_device.configuration simple
_raman_measurement_device.resolution 3
_raman_measurement_device.power_on_sample 2
_raman_measurement_device.direction_polarization unoriented
_raman_measurement_device.spot_size 0.8
_raman_measurement_device.diffraction_grating 600
_raman_measurement.environment air
_raman_measurement.environment_details
_raman_measurement.temperature 300
_raman_measurement.pressure 100
_raman_measurement.background_subtraction no
_raman_measurement.background_subtraction_details
_raman_measurement.baseline_correction no
_raman_measurement.baseline_correction_details
Try
df = pd.read_csv(
"out.txt", header=None, delim_whitespace=True, quotechar="'", keep_default_na=False
)
Result for your sample:
0 1
0 _journal_issue 23
1 _journal_name_full Physical Chemistry and Chemical Physics
2 _journal_page_first 9197
...
46 _raman_measurement.background_subtraction_details
47 _raman_measurement.baseline_correction no
48 _raman_measurement.baseline_correction_details
As per pandas documentation, pd.read_csv, you can provide sep as only string and string like r"<String>" is usually used for Raw string.
What I would recommend is first loop through file and replace all delimiter to a common delimiter and then feed file to pandas read_csv.
Apparently, Above answer is not true. jjramsey from comment below has mentioned it perfectly how it's wrong.
In the Pandas documentation for read_csv(), it says, "separators longer than 1 character and different from \s+ will be interpreted as regular expressions." So, no separators are not always fixed strings. Also, r is a perfectly good prefix for strings used for regular expressions, since it avoids having to escape backslashes.

Am I using groupby.sum() correctly?

I've the following code, and a problem in the new_df["SUM"] line:
import pandas as pd
df = pd.read_excel(r"D:\Tesina\Proteoma Humano\Tablas\uno - copia.xlsx")
#df = pd.DataFrame({'ID': ['C9JLR9','O95391', 'P05114',"P14866"], 'SEQ': ['1..100,182..250,329..417,490..583', '1..100,206..254,493..586', '1..100', "1..100,284..378" ]})
df2 = pd.DataFrame
df["SEQ"] = df["SEQ"].replace("\.\."," ", regex =True)
new_df = df.assign(SEQ=df.SEQ.str.split(',')).explode('SEQ')
for index, row in df.iterrows():
new_df['delta'] = new_df['SEQ'].map(lambda x: (int(x.split()[1])+1)-int(x.split()[0]) if x.split()[0] != '1' else (int(x.split()[1])+1))
new_df["SUM"] = new_df.groupby(["ID"]).sum().reset_index(drop=True) #Here's the error, even though I can't see where
df2 = new_df.groupby(["ID","SUM"], sort=False)["SEQ"].apply((lambda x: ','.join(x.astype(str)))).reset_index(name="SEQ")
To give some context, what it does is the following: grabs every line with the same ID, separates the numbers with a "," in between, does some math with those numbers (that's where the "delta" (which i know it's not a delta) line gets involved), and finally sums up all the "delta" for each ID, grouping them all by their original ID, so I maintain the same numbers of rows.
And, when I use a sample of the data (the one that´s commented at the beginning), it works perfectly, giving me the ouptut that I wish:
ID SUM SEQ
0 C9JLR9 353 1 100,182 250,329 417,490 583
1 O95391 244 1 100,206 254,493 586
2 P05114 101 1 100
3 P14866 196 1 100,284 378
But, when I aply it on my Excel file (that has 10471 rows), the groupby.sum() line doesn't work as it's supposed to (I've already checked everything else, I know the error is within that line).
This is the output that I receive:
ID SUM SEQ
0 C9JLR9 39 1 100,182 250,329 417,490 583
1 O95391 20 1 100,206 254,493 586
2 P05114 33 1 100
4 P98177 21 1 100,176 246
You can clearly see that the SUM values differ (and are not correct at all). I haven't been able to figure out where those numbers come from, also. It's really weird.
If anyone is interested, the solution was provided in the comments: I had to change the line with the following:
new_df["SUM"] = new_df.groupby("ID")["delta"].transform("sum")

Split a pandas dataframe header into multiple columns

I'm trying to split the dataframe header id;signin_count;status into more columns where I can put my data into. I've tried df.columns.values, but I couldn't get a string to use .split in, as I was hoping. Instead, I got:
Index(['id;signin_count;status'], dtype='object')
Which returns AttributeError: 'Index' object has no attribute 'split' when I try .split
In broader terms, I have:
id;signin_count;status
0 353;20;done;
1 374;94;pending;
2 377;4;done;
And want:
id signin_count status
0 353 20 done
1 374 94 pending
2 377 4 done
Splitting the data itself is not the problem here, that I can do. The focus is on how to access the header names without hardcoding it, as I will have to do the same with any other dataset with the same format
From the get-go, thank you
If you are reading your data from a csv file you can define sep to ; and read it as:
df=pd.read_csv('filename.csv', sep=';', index_col=False)
Output:
id signin_count status
0 353 20 done
1 374 94 pending
2 377 4 done

How to find multiple substrings between <> in one column in pandas data frame + python

I am using Pandas and Python. My data is:
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
I want to extract all the sub strings between < > and merge them with blank. For example, the above example should have the result:
aafae afre
433
1234334 a
bijf 9tu0 vie
nan
So all the sub strings between < > are extracted. There will be nan if no such strings. I have already tried re library and str functions. But I am really new to regex. Could anyone help me out here.
Use pandas.Series.str.findall:
a['Str'].str.findall('<(.*?)>').apply(' '.join)
Output:
0 aafae afre
1 433
2 1234334 a
3 bijf 9tu0 vie
4
Name: Str, dtype: object
Maybe, this expression might work somewhat and to some extent:
import pandas as pd
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
a["new_str"]=a["Str"].str.replace(r'.*?<([^>]+)>|(?:.+)', r'\1 ',regex=True)
print(a)

Categories