I am building a Neural Machine Automatic Translator from German to Arabic. I am reading a CSV file containing German sentences and their corresponding Arabic translations. I want to read both languages at the same time using pd.read_csv. I have tried all the codes for all languages in this Python documentation but none of them worked.
The only thing that worked best for me is this:
df = pd.read_csv("DLATS.csv", encoding ='windows-1256')
'windows-1256' is the encoding Alias for the Arabic language. But the problem is that it doesn't catch the German special characters like (ä) but it converts them into question marks (?). So the word drängte became dr?ngte.
So, can anyone please help me to solve this problem or how to work around it? I have thought of separating the German and Arabic sentences in separate CSV files so that each CSV file contains one row only, and then maybe I will try to mix them in the Python code. But it seems that pd.read_csv requires at least two columns in the CSV file to work.
Update: I have noticed that the original csv file contains these problems as well for the German language. So, I have finally managed to solve my problem by reading excel directly instead of csv since the original file is in Excel, so I used pd.read_excel without any encoding attribute and it worked well. I didn't know before that pandas has pd.read_excel.
In my case I use clear read_csv.
import pandas as pd
df = pd.read_csv('download.csv')
print(df)
german arabic
0 drängte حث
If you get bad results it is possible that data is not properly saved in csv.
Related
I have a .csv file that has the format below:
I am using pandas to read it and then encoding it using utf-8 but it looks like pandas isn't splitting the columns "Sentence" and "Label" properly
I am then vectorizing it using CountVectorizer and then turning it into an array before doing the train-test split for machine learning modeling."
However I am getting an error just saying 'Sentence' when I try to "fit_transform"
I think the error is coming up since there is no column 'Sentence', it is still viewing it as "Sentence,Label'. Does anyone know if it's a pandas issue or an encoding issue?
is there way to prevent data exported from python to be converted into the scientific notation in excel.
ID
1E1
2E9
3E4
After exporting in csv format iam getting:
ID
1.00E+01
2.00E+09
3.00E+04
I found a similar thread however none have a clear explanation or links were broken.
This is not the issue with Python writing the wrong value in CSV file. If you open the csv file, you will see value is written in correct numeric format. If that is not the case, please provide your code and sample data.
Assuming it is written correctly in CSV using python, then Please look for converting the values in excel from scientific notation to text or number.
I'm doing a sentiments analysis using Python (I'm still a rookie with this specific programming language). I have some Twitter data in a csv file that I need to pre-process before doing the real analysis. First of all I need to tokenize the text from a specific column, in my case the second or col B. I found some suggestions how to do the tokenization but not to pick the specific col. Anyone who has experience with this?
I tried this code, which seems to work for all columns, but how can I isolate it to the second col?
import csv
import nltk
from nltk import word_tokenize
with open('TwitterData.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
Any suggestions to modules and code that works for pre-processing to sentiments analysis?
Thanks a lot!
I can highly recommend you the scikit-learn documentation and modules, especially the part about "Working with Text Data": https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
There they also have a section about sentiment analysis: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#exercise-2-sentiment-analysis-on-movie-reviews
If you need more specific help with your code, it is alway best to provide a "minimal reproducable example": https://stackoverflow.com/help/minimal-reproducible-example
This way, others can help you better with a specific issue you are facing.
I hope that helps :)
I am new to Python/Pandas. I am wondering if there's a code that can help me fix how the columns move to the right inside the .csv we pull out of our systems - one column is filled with user input (containing messy characters ",) so usually after loading the user input column spreads out on several columns instead of one, wrongly moving out to the right the other columns as well.
I fix this manually in excel, manually filtering, deleting, moving the columns to their correct place - it takes 20 mins / day.
I would like to ask advice if there is code which I could try to clean and arrange correctly the columns or if it is easier the manual fix in excel as I do it now. Thank you!
pandas is altering the columns because it sees 'separators' in the import file.
In Excel, for each newline, count how many times a comma appears. Using your example above there should be 3 per line.
My quick and dirty solution would be replace the last three commas in your file with a character that is almost impossible for a user to type, I typically go for a pipe '|' character.
Try importing that into pandas, specifying a new delimier/separator example below:
import pandas as pd
df = pd.read_csv(filepath, sep="|")
df.head()
You cannot play with the layout with CSV that is a pure data transport format. Hopefully, there are 3rd party libs that can play with .xlsx files here and here.
I'm collecting all the comments from some Facebook pages using Python and Facebook-SDK.
Since I want to do Sentiment Analysis on these comments, what's the best way to save these texts, such that it's not needed any changing in the texts?
I'm now saving the comments as a table and then as a CSV file.
table.to_csv('file-name.csv')
But if I want to read this saved file, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position ...
By the way, I'm working with the German Texts.
Have you tried this?
Set default encoder at the top of your code
import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")
or
pd.read_csv('file-name.csv', encoding = "ISO-8859-1")
If you have knowledge about the encoding of the data then, you can simply use pandas to read your csv as follow:
import pandas as pd
pd.read_csv('filename.csv', encoding='encoding')
I would say it really depends on many different factors such as:
Size of the data
What kind of analysis, specifically, are you anticipating that you'll be doing
What format are you most comfortable working with the data
For most of my data munging in python I like to do it in pandas if possible, but sometimes that's not a feasible option given the size of the data. In that case you'd have to think about using something like pyspark. But here is a link to the pandas docs for reference, they have a lot of functionality for reading in all kinds of data: pandas docs