I am using Python 3 and scrapy to crawl some data. For some instances, I have 2 sentences which would like to write to excel as comma separated csv file.
How can I make them not to split into new line concernig the '\r\n'? Instead, how can I treat the whole sentence as a string
The sentences are as below
'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...',
'USBからインストールしよう...',
Without seeing a code snippet of how you are parsing the strings, it's a bit difficult to suggest how exactly you can solve your problem. Anyway, you can always use replace to remove the occurrences of \r\n from your string:
>>> string = 'abc\r\ndef'
>>> print string
abc
def
>>> string.replace('\r\n', ' ')
'abc def'
Since you want to write to a CSV file, I'd suggest you use pandas dataframe as it make life whole lot a easier.
import pandas as pd
string="'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...'\r\n'USBからインストールしよう...'"
string = string.repalce('\r\n',',')
list1=list(string)
df = pd.DataFrame(list1,columns=None,header=False)
df.to_csv('file.csv')
Thanks for all the advice and possible solution provided above. Yet, I have found the way to solve it.
string="'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...'\r\n'USBからインストールしよう...'"
string.replace('\r','\\r').replace('\n','\\n'')
Then write this string into the csv will make the csv show \r\n together with the other text as a whole string.
Related
So I am trying to create a program to do the following:
Allow a user to manually input some alphanumeric characters, with some regex included - e.g. ^MASDJOEUFJ0.|^WAOIFUWH2IW9.|^abcd130.
Remove all regex characters/delimiters - ,.|^
Print out the new alphanumeric string - e.g. MASDJOEUFJ0 WAOIFUWH2IW9 abcd130
Load the contents of an Excel spreadsheet into memory, for comparison purposes
Compare the alphanumeric string (in step 3) against the contents of the Excel spreadsheet
Print/highlight only the differences
I am new to Python but using my previous programming experience I have created a program which will do up to step 4 but I am having issues trying to work out the last 2 steps - here is what I've done so far:
import re
import pandas as pd
str = input("Enter Regex : ")
pattern = r"['^\', '\\.|']"
str = re.sub(pattern, " ", str)
#str = str.split()
print (str, "\n", "\n")
df = pd.read_excel (r"C:\Users\...\...\...\Spreadsheet_Comparison.xlsx")
#print (df, "\n", "\n")
I am not sure if I have even used the correct approach so far or not, so any help/guidance here is appreciated.
I am aware that this might not be the most professional way of writing this program but I don't need it to be, I just need something basic that will do the job, and that is easy and straightforward to follow.
Thanks in advance for all the help.
While writing strings containing certain special characters, such as
Töölönlahdenkatu
using to_csv from pandas, the result in the csv looks like
T%C3%B6%C3%B6l%C3%B6nlahdenkatu
How do we get to write the text of string as it is? This is my to_csv command
df.to_csv(csv_path,index=False,encoding='utf8')
I have even tried
df.to_csv(csv_path,index=False,encoding='utf-8')
df.to_csv(csv_path,index=False,encoding='utf-8-sig')
and still no success.There are other characters replaced with random symbols
'-' to –
Is there a workaround?
What you're trying to do is remove German umlauts and Spanish tildes. There is an easy solution for that.
import unicodedata
data = u'Töölönlahdenkatu Adiós Pequeño'
english = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print(english)
output : b'Toolonlahdenkatu Adios Pequeno'
Let me know if it works or if there are any edge cases.
Special characters like ö cannot be stored in a csv the same way english letters can. The "random symbols" tell a program like excel to interpret the letters as special characters when you open the file, but special characters cannot be seen when you view the csv in vscode (for instance).
This question already has answers here:
How to convert a string to a number if it has commas in it as thousands separators?
(10 answers)
Closed 3 years ago.
My .txt have a lot of numbers divided in two rows but they are in the brazilian way to write them (this means the number 3.41 is writen as 3,41)... I know how to read each column, I just need to change every comma in the .txt to a dot, but I have no idea how to do that...
There's 3 ways I thought how to solve the problem:
Changing every comma into a dot and overwrite the previous .txt,
Write another .txt with another name, but with every comma changed into a dot,
Import every string (that should be float) from the txt and use replace to change the "," into a ".".
If you can help me with one of the first two ways would be better, especially the first one
(I just imported numpy and don't know how to use others library yet, so if you could help me with the codes and recipes I would really appreciate that) (sorry about the bad english, love ya)
Try this:
with open('input.txt') as input_f, open('output.txt', 'w') as output_f:
for line in input_f.readlines():
output_f.write(line.replace(',', '.'))
for input.txt:
1,2,3,4,5
10,20,30,40
the output will be:
1.2.3.4.5
10.20.30.40.
while your question is tagged python, here's a super-simple non-pythonic way, using the sed cmdline utility.
This will replace all commas (,) with dots (.) in your textfile, overwriting the original file:
sed -e 's/,/./g' -i yourtext.txt
Or, if you want the output in a different file:
sed -e 's/,/./g' yourtext.txt > newfile.txt
umlaute's answer is good, but if you insist on doing this in Python you can use fileinput, which supports inplace replacement:
import fileinput
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
for line in file:
line.replace(',', '~').replace('.', ',').replace('~', '.')
This example assumes you have .'s and ,'s in your example already so uses the tilde as an interim character while fixing both characters. If you have ~'s in your data, feel free to swap that out for another uncommon character.
If you're working with a csv, be careful not to replace your column delimiter character. In this case, you'll probably want to use regex replace instead to ensure that each comma replaced is surrounded by digits: r'\d,\d'
I'm trying to import csv with pandas read_csv and can't get the lines containing the following snippet to work:
"","",""BSF" code - Intermittant, see notes",""
I am able to get pass it via with the options error_bad_lines=False, low_memory=False, engine='c'. However it should be possible to parse them correctly. I'm not good with regular expressions so I didn't try using engine='python', sep=regex yet. Thanks for any help.
Well, that's quite a hard one ... given that all fields are quoted you could use a regex to only use , followed and preceded by " as a separator:
data = pd.read_csv(filename,sep=r'(?<="),(?=")',quotechar='"')
However, you will still end up with quotes around all fields, but you could fix this by applying
data = data.applymap(lambda s:s[1:-1])
I'm at a total loss of how to do this.
My Question: I want to take this:
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
... (continue)
To this:
"A, two words with comma",B,C word without comma,D
"E, two words with comma",F,G more stuff,H no commas here!
... (continue)
I used software that created 1,900 records in a text file and I think it was supposed to be a CSV but whoever wrote the software doesn't know how CSV files work because it only needs quotes if the cell contains a comma (right?). At least I know that in Excel it puts everything in the first cell...
I would prefer this to be solvable using some sort of command line tool like perl or python (I'm on a Mac). I don't want to make a whole project in Java or anything to take care of this.
Any help is greatly appreciated!
Shot in the dark here, but I think that Excel is putting everything in the first column because it doesn't know it's being given comma-separated data.
Excel has a "text-to-columns" feature, where you'll be able to split a column by a delimiter (make sure you choose the comma).
There's more info here:
http://support.microsoft.com/kb/214261
edit
You might also try renaming the file from *.txt to *.csv. That will change the way Excel reads the file, so it better understands how to parse whatever it finds inside.
If just bashing is an option, you can try this one-liner in a terminal:
cat file.csv | sed 's/"\([^,]*\)"/\1/g' >> new-file.csv
That technically should be fine. It is text delimited with the " and separated via the ,
I don't see anything wrong with the first at all, any field may be quoted, only some require it. More than likely the writer of the code didn't want to over complicate the logic and quoted everything.
One way to clean it up is to feed the data to csv and dump it back.
import csv
from cStringIO import StringIO
bad_data = """\
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
"""
buffer = StringIO()
writer = csv.writer(buffer)
writer.writerows(csv.reader(bad_data.split('\n')))
buffer.seek(0)
print buffer.read()
Python's csv.writer will default to the "excel" dialect, so it will not write the commas when not necessary.