Replacing part of string in python pandas dataframe - python

I have a similar problem to the one posted here:
Pandas DataFrame: remove unwanted parts from strings in a column
I need to remove newline characters from within a string in a DataFrame. Basically, I've accessed an api using python's json module and that's all ok. Creating the DataFrame works amazingly, too. However, when I want to finally output the end result into a csv, I get a bit stuck, because there are newlines that are creating false 'new rows' in the csv file.
So basically I'm trying to turn this:
'...this is a paragraph.
And this is another paragraph...'
into this:
'...this is a paragraph. And this is another paragraph...'
I don't care about preserving any kind of '\n' or any special symbols for the paragraph break. So it can be stripped right out.
I've tried a few variations:
misc['product_desc'] = misc['product_desc'].strip('\n')
AttributeError: 'Series' object has no attribute 'strip'
here's another
misc['product_desc'] = misc['product_desc'].str.strip('\n')
TypeError: wrapper() takes exactly 1 argument (2 given)
misc['product_desc'] = misc['product_desc'].map(lambda x: x.strip('\n'))
misc['product_desc'] = misc['product_desc'].map(lambda x: x.strip('\n\t'))
There is no error message, but the newline characters don't go away, either. Same thing with this:
misc = misc.replace('\n', '')
The write to csv line is this:
misc_id.to_csv('C:\Users\jlalonde\Desktop\misc_w_id.csv', sep=' ', na_rep='', index=False, encoding='utf-8')
Version of Pandas is 0.9.1
Thanks! :)

strip only removes the specified characters at the beginning and end of the string. If you want to remove all \n, you need to use replace.
misc['product_desc'] = misc['product_desc'].str.replace('\n', '')

You could use regex parameter of replace method to achieve that:
misc['product_desc'] = misc['product_desc'].replace(to_replace='\n', value='', regex=True)

Related

How to stop to_csv from using comma as separator

I have a dataframe in which one of the columns contains list object like [1,2].
I am trying to export to csv with the following line
df.to_csv('df.csv', sep = ';')
However, the resultant csv, instead of having each row in a single cell, split the row at the comma inside the list object, so I have something like
Column A
Column B
0;xxx;xxx;[1
2];xxx;xxx;xx
Can someone help? Thanks!
What I want is
Column A
0;xxx;xxx;[1,2];xxx;xxx;xx
Updates:
I have tried to make the column filled with strings like
"[1,2,3]" or "100,000,000", it would still split at the comma.
You’ll probably need to surround the list objects with double quotes to make them strings. Then you can use something like this to transform each string back into a list:
import ast
ast.literal_eval(list_as_string)
This problem can be solved by using quotes:
import csv
df.to_csv('df.csv', sep = ',', quoting=csv.QUOTE_ALL)

Pandas read csv skips some lines

Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol.
Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')

Cannot replace special characters in a Python pandas dataframe

I'm working with Python 3.5 in Windows. I have a dataframe where a 'titles' str type column contains titles of headlines, some of which have special characters such as â,€,˜.
I am trying to replace these with a space '' using pandas.replace. I have tried various iterations and nothing works. I am able to replace regular characters, but these special characters just don't seem to work.
The code runs without error, but the replacement simply does not occur, and instead the original title is returned. Below is what I have tried already. Any advice would be much appreciated.
df['clean_title'] = df['titles'].replace('€','',regex=True)
df['clean_titles'] = df['titles'].replace('€','')
df['clean_titles'] = df['titles'].str.replace('€','')
def clean_text(row):
return re.sub('€','',str(row))
return str(row).replace('€','')
df['clean_title'] = df['titles'].apply(clean_text)
We can only assume that you refer to non-ASCI as 'special' characters.
To remove all non-ASCI characters in a pandas dataframe column, do the following:
df['clean_titles'] = df['titles'].str.replace(r'[^\x00-\x7f]', '')
Note that this is a scalable solution as it works for any non-ASCI char.
How to remove escape sequence character in dataframe
Data.
product,rating
pest,<br> test
mouse,/ mousetest
Solution: scala Code
val finaldf = df.withColumn("rating", regexp_replace(col("rating"), "\\\\", "/")).show()

Python writing into csv without line break \r\n

I am using Python 3 and scrapy to crawl some data. For some instances, I have 2 sentences which would like to write to excel as comma separated csv file.
How can I make them not to split into new line concernig the '\r\n'? Instead, how can I treat the whole sentence as a string
The sentences are as below
'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...',
'USBからインストールしよう...',
Without seeing a code snippet of how you are parsing the strings, it's a bit difficult to suggest how exactly you can solve your problem. Anyway, you can always use replace to remove the occurrences of \r\n from your string:
>>> string = 'abc\r\ndef'
>>> print string
abc
def
>>> string.replace('\r\n', ' ')
'abc def'
Since you want to write to a CSV file, I'd suggest you use pandas dataframe as it make life whole lot a easier.
import pandas as pd
string="'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...'\r\n'USBからインストールしよう...'"
string = string.repalce('\r\n',',')
list1=list(string)
df = pd.DataFrame(list1,columns=None,header=False)
df.to_csv('file.csv')
Thanks for all the advice and possible solution provided above. Yet, I have found the way to solve it.
string="'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...'\r\n'USBからインストールしよう...'"
string.replace('\r','\\r').replace('\n','\\n'')
Then write this string into the csv will make the csv show \r\n together with the other text as a whole string.

How to read csv lines with pandas containing " and ' between quoting character "?

I'm trying to import csv with pandas read_csv and can't get the lines containing the following snippet to work:
"","",""BSF" code - Intermittant, see notes",""
I am able to get pass it via with the options error_bad_lines=False, low_memory=False, engine='c'. However it should be possible to parse them correctly. I'm not good with regular expressions so I didn't try using engine='python', sep=regex yet. Thanks for any help.
Well, that's quite a hard one ... given that all fields are quoted you could use a regex to only use , followed and preceded by " as a separator:
data = pd.read_csv(filename,sep=r'(?<="),(?=")',quotechar='"')
However, you will still end up with quotes around all fields, but you could fix this by applying
data = data.applymap(lambda s:s[1:-1])

Categories