Replacing a Character in .csv file only for specific strings - python

I am trying to clean a file and have removed the majority of unnecessary data excluding this one issue. The file I am cleaning is made up of rows containing numbers, see below example of a few rows.
[Example of data][1] [1]: https://i.stack.imgur.com/0bADX.png
You can see that I have cleaned the data so that there is a space between each character aside from the four characters that start each row. There are some character groupings that I have not yet added a space between each character because I need to replace the "1"s with a space rather than keeping the "1"s.
Strings I still need to clean2: https://i.stack.imgur.com/gmeUs.png
I have tried the following two methods in order to replace the 1's in these specific strings, but both produce results that I do not want.
Method 1 - Replacing 1's before splitting characters into their own columns
Data2 = pd.read_csv(filename.csv)
Data2['Column']=Data2['Column'].apply(lambda x: x.replace('1',' ') if len(x)>4 else x)
This method results in the replacement of every 1 in the entire file, not just the 1's in the strings like those pictured above (formatted like "8181818"). I would think that the if statement would excluded the removal of the 1's where there are less than 4 characters grouped together.
Method 2 - Replacing 1's after splitting characters into their own columns
Since Method 1 was resulting in the removal of each 1 in the file, I figured I could split each string into its own column (essentially using the spaces as a delimiter) and then try a similar method to clean these unnecessary 1's by focusing on the specific columns where these strings are located (columns 89, 951, and 961).
Data2[89]=Data2[89].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[89].str.split(' ').tolist()
Data2[89] = pd.DataFrame(Data2[89].str.split(' ').tolist())
Data2[951]=Data2[951].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[951].str.split(' ').tolist()
Data2[951] = pd.DataFrame(Data2[951].str.split(' ').tolist())
Data2[961]=Data2[961].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[961].str.split(' ').tolist()
Data2[961] = pd.DataFrame(Data2[961].str.split(' ').tolist())
This method successfully removed only the 1's in these strings, but when I am then splitting the numbers I am keeping from these strings into their own columns they are overwriting the existing values in those columns rather than pushing those existing values into columns further down the line.
Any assistance on either of these methods or advice on if there is a different approach I should be taking would be much appreciated.

Related

How to perform str.strip in dataframe and save it with inplace=true?

I have dataframe with n columns. And I want to perform a strip to strings in one of the columns in the dataframe. I was able to do it, but I want this change to reflect in the original dataframe.
Dataframe: data
Name
0 210123278414410005
1 101232784144610006
2 210123278414410007
3 21012-27841-410008
4 210123278414410009
After stripping:
Name
0 10005
1 10006
2 10007
3 10008
4 10009
5 10010
I tried the below code and strip was successful
data['Name'].str.strip().str[13:]
However if I check dataframe, the strip is not reflected.
I am looking for something like inplace parameter.
String methods (the attributes of the .str attribute on a series) will only ever return a new Series, you can't use these for in-place changes. Your only option is to assign it back to the same column:
data['Name'] = data['Name'].str.strip().str[13:]
You could instead use the Series.replace() method with a regular expression, and inplace=True:
data['Name'].replace(r'(?s)\A\s*(.{,13}).*(?<!\s)\s*\Z', r'\1', regex=True, inplace=True)
The regular expression above matches up to 13 characters after leading whitespace, and ignores trailing whitespace and any other characters beyond the first 13 after whitespace is removed. It produces the same output as .str.strip().str[:13], but makes the changes in place.
The pattern is using a negative look-behind to make sure that the final \s* pattern matches all whitespace elements at the end before selecting between 0 and 13 characters of what remains. The \A and \Z anchors make it so the whole string is matched, and the (?s) at the start switches the . pattern (dot, any character except newlines) to include newlines when matching; this way an input value like ' foo\nbar ' is handled correctly.
Put differently, the \A\s* and (?<!\s)\s*\Z patterns act like str.strip() does, matching all whitespace at the start and end, respectively, and no more. The (.{,13)).* pattern matches everything in between, with the first 13 characters of those (or fewer, if there are not enough characters to match after stripping) captured as a group. That one group is then used as the replacement value.
And because . doesn't normally match \n characters, the (?s) flag at the start tells the regex engine to match newline characters anyway. We want all characters to be included after stripping, not just all except one.
data['Name'].str.strip().str[13:] returns you the new transformed column, but it is not changing in-place data (inside the dataframe). You should write:
data['Name'] = data['Name'].str.strip().str[13:]
to write the transformed data to the Name column.
I agree with the other answers that there's no inplace parameter for the strip function, as seen in the documentation for str.strip.
To add to that: I've found the str functions for pandas Series usually used when selecting specific rows. Like df[df['Name'].str.contains('69'). I'd say this is a possible reason that it doesn't have an inplace parameter -- it's not meant to be completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or we'll consistently get "last 5 characters" instead!
As per yatu's comment: you should reassign the Series with stripped values to the original column.
data['Name'] = data['Name'].str.strip().str[13:]
It is interesting to note that pandas DataFrames do work on numpy beneath.
There is also the idea to do broadcasting operations in numpy.
Here is the example I had in mind:
import numpy as np
import pandas as pd
df=pd.DataFrame([['210123278414410005', '101232784144610006']])
dfn=df.to_numpy(copy=False) #numpy array
df=pd.DataFrame(np.frompyfunc(lambda dfn: dfn[13:],1,1)(dfn) )
print(df) #10005 10006
This doesn't answer your question but it is just another option (it creates a new datafram from the numpy array though).

Python: Find a character to retrieve an index of string to replace with another character

I know string slicing and indexing is fairly straight forward but I can't seem to make my code work here. Sorry I am a newbie and just learning!
I am trying to check if each item in a list (called "lines") contains a certain string. The strings are pulled from another list (called "suffixes"), and I want to return an index, so I can replace the first character, a white space, with dash "-".
However the str.find method is returning -1 in most cases, meaning the string is not found, except in one case where it returns 43 when the first string in "suffixes" is found in an item in "lines".
Example output:
Acephate Butachlor Cycloate Dimethoate (Sum) -1
Aldicarb Captan (Sum) Cyprodinil Disulfoton -1
Aldicarb (Sum) Carbaryl Cyromazine Disulfoton (Sum) -1
Amitraz Carboxine DDT (Sum) Dodemorph -1
Azamethiphos Chlorantraniliprole Deltamethrin Endosulfan (A+B+Sulf) -1
Azinphos-ethyl Chlordane Demeton Endosulfan Alfa 43
Azinphos-methyl Chlordane Trans Demeton-S-methyl-sulfone Endosulfan Beta -1
I suspect it is ONLY searching for the first, but I have followed the syntax I found in multiple places, so I can't see why.
lines = ['', 'Abamectin Buprofezin Cyazofamid Dimethoate', '', 'Acephate Butachlor Cycloate Dimethoate (Sum)', '', 'Acequinocyl Butocarboxim Cycloxydim Dimethomorph', '', 'Acetamiprid Butralin Cyflufenamid Diniconazole', '', 'Acetochlor Cadusafos Cyfluthrin Dinocap', '', 'Acrinathrin Captafol Cymoxanil Dinotefuran', '', 'Alachlor Captan Cyproconazole Diphenylamine', '']
"""if there are any suffixes, then join to the preceeding word with a dash, so then can split data by spaces"""
suffixes=[" Alfa"," Beta"," Sulfate"," sulfoxide"," (DCPA)"," (Sum)"," (Folpet)"," sulphone"," butoxide"," Methyl"," (A+D)"," (THPI)"," (A+B+Sulf)"]
for line in lines:
if any(suffix in line for suffix in suffixes):
print(line, line.find(suffix))
ind=line.find(suffix)
line[ind].replace(' ','-')
Once I have joined some of the words with their suffixes, using a "-", then I will split the rest of the items in "lines" into new items splitting by white space.
The issue I am facing: If I any of the strings in "suffixes" are found (note, each has whitespace at the start of the string) as a sub-string to the items in the list "lines", I want the index to be returned. This is not happening currently. Instead the output is just showing one case of where the first string in "suffixes" is found and the loop is finishing.
If I add the line:
if index != -1:
print(line,line.find(suffix))
Then my expected output would be something like:
Acephate Butachlor Cycloate Dimethoate (Sum) 38
Azamethiphos Chlorantraniliprole Deltamethrin Endosulfan (A+B+Sulf) 56
etc....
Edit: Although my problem has been solved another way I would like to understand why my code is not returning the index as I want.
There's no need for indexing, you can just try the replacement. If the suffix isn't present, then it just won't be replaced.
for suffix in suffixes:
lines = [line.replace(suffix, suffix.replace(" ", "-")) for line in lines]
You may also being having problems with the case. You have "Ethyl" in your list of suffixes, but "ethyl" in the output.

How To Remove \x95 chars from text - Pandas?

I am having trouble to remove a space at the beginning of string in pandas dataframe cells. If you look at the dataframe cells, it seems like that there is a space at the start of string however it prints "\x95 12345" when you output one of cells which has set of chars at the beginning, so as you can see it is not a normal space char but it is rather "\x95"
I already tried to use strip() - But it didn't help.
Dataframe was produced after the use of str.split(pat=',').tolist() expression which basically split the strings into different cells by ',' so now my strings have this char added.
Assuming col1 is your first column name:
import re
df.col1 = df.col1.apply(lambda x: re.sub(r'\x95',"",x))

Python: Replace the first nth matching letter to another letter

This is related to trimming a csv file process.
I have a mar-formatted csv file that has 4 columns, but the last column has too many (and unknown number of) commas.
I want to replace the delimiter to another character such as "|"
For example, string = "a,b,c,d,e,f" into "a|b|c|d,e,f"
The following codes works, but I like to find a better and efficient way to process large size txt file.
sample_txt='a,b,c,d,e,f'
temp=sample_txt.split(",")
output_txt='|'.join(temp[0:3])+'|'+','.join(temp[3:])
Python has the perfect way to do this, with str.replace:
>>> sample_txt='a,b,c,d,e,f'
>>> print(sample_txt.replace(',', '|', 3))
a|b|c|d,e,f
str.replace takes an optional third argument (or fourth if you count self) which dictates the maximum number of replacements to happen.
sample_txt='a,b,c,d,e,f'
output_txt = sample_txt.replace(',', '|', 3)

replace string in pandas dataframe

I have a dataframe with multiple columns. I want to look at one column and if any of the strings in the column contain #, I want to replace them with another string. How would I go about doing this?
A dataframe in pandas is composed of columns which are series - Panda docs link
I'm going to use regex, because it's useful and everyone needs practice, myself included! Panda docs for text manipulation
Note the str.replace. The regex string you want is this (it worked for me): '.*#+.*' which says "any character (.) zero or more times (*), followed by an # 1 or more times (+) followed by any character (.) zero or more times (*)
df['column'] = df['column'].str.replace('.*#+.*', 'replacement')
Should work, where 'replacement' is whatever string you want to put in.
My suggestion:
df['col'] = ['new string' if '#' in x else x for x in df['col']]
not sure which is faster.
Assuming you called your dataframe df, you can do:
pd.DataFrame(map(lambda col: map(lambda x: 'anotherString' if '#' in x else x, df[col]), df.columns)).transpose()

Categories