I have a csv file with three columns, namely (cid,ccontent,value) . And I want to loop through each word in ccontent column and translate the words individually.
I found this code for translating a row but I want to translate each word not the row.
How to write a function in Python that translates each row of a csv to another language?
from googletrans import Translator
import pandas as pd
headers = ['A','B','A_translation', 'B_translation']
data = pd.read_csv('./data.csv')
translator = Translator()
# Init empty dataframe with much rows as `data`
df = pd.DataFrame(index=range(0,len(data)), columns=headers)
def translate_row(row):
''' Translate elements A and B within `row`. '''
a = translator.translate(row[0], dest='Fr')
b = translator.translate(row[1], dest='Fr')
return pd.Series([a.origin, b.origin, a.text, b.text], headers)
for i, row in enumerate(data.values):
# Fill empty dataframe with given serie.
df.loc[i] = translate_row(row)
print(df)
Thank you
You can try along the lines of, using list comprehension:
def translate_row(row):
row0bywords = [translator.translate(eachword, dest='Fr') for eachword in row[0]]
orw1bywords = [translator.translate(eachword, dest='Fr') for eachword in row[1]]
return row0bywords, row1bywords
Related
I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
I'm trying to retrieve a string from an excel sheet and split it into words then print it or write it back into a new string but when retrieving the data using pandas and trying to split it an error occurs saying dataframe doesn't support split function
the excel sheet has this line in it:
I expect and output like this:
import numpy
import pandas as pd
df = pd.read_excel('eng.xlsx')
txt = df
x = txt.split()
print(x)
AttributeError: 'DataFrame' object has no attribute 'split'
That's because you are applying split() function on a DataFrame and that's not possible.
import pandas as pd
import numpy as np
def append_nan(x, max_len):
"""
Function to append NaN value into a list based on a max length
"""
if len(x) < max_len:
x += [np.nan]*(max_len - len(x))
return x
# I define here a dataframe for the example
#df = pd.DataFrame(['This is my first sentence', 'This is a second sentence with more words'])
df = pd.read_excel('your_file.xlsx', index=None, header=None)
col_names = df.columns.values.tolist()
df_output = df.copy()
# Split your strings
df_output[col_names[0]] = df[col_names[0]].apply(lambda x: x.split(' '))
# Get the maximum length of all yours sentences
max_len = max(map(len, df_output[col_names[0]]))
# Append NaN value to have the same number for all column
df_output[col_names[0]] = df_output[col_names[0]].apply(lambda x: append_nan(x, max_len))
# Create columns names and build your dataframe
column_names = ["word_"+str(d) for d in range(max_len)]
df_output = pd.DataFrame(list(df_output[col_names[0]]), columns=column_names)
# Then you can save it
df_output.to_excel('output.xlsx')
I have some code that reads a table in a Word document and makes a dataframe from it.
import numpy as np
import pandas as pd
from docx import Document
#### Time for some old fashioned user functions ####
def make_dataframe(f_name, table_loc):
document = Document(f_name)
tables = document.tables[table_loc]
for i, row in enumerate(tables.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame.from_dict(data)
return df
SHRD_filename = "SHRD - 12485.docx"
SHDD_filename = "SHDD - 12485.docx"
df_SHRD = make_dataframe(SHRD_filename,30)
df_SHDD = make_dataframe(SHDD_filename,-60)
Because the files are different (for instance the SHRD has 32 tables and the one I am looking for is the second to last, but the SHDD file has 280 tables, and the one I am looking for is 60th from the end. But that may not always be the case.
How do I search through the tables in a document and start working on the one that cell[0,0] = 'Tag Numbers'.
You can iterate through the tables and check the text in the first cell. I have modified the output to return a list of dataframes, just in case more than one table is found. It will return an empty list if no table meets the criteria.
def make_dataframe(f_name, first_cell_string='tag number'):
document = Document(f_name)
# create a list of all of the table object with text of the
# first cell equal to `first_cell_string`
tables = [t for t in document.tables
if t.cell(0,0).text.lower().strip()==first_cell_string]
# in the case that more than one table is found
out = []
for table in tables:
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
out.append(pd.DataFrame.from_dict(data))
return out
# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131
I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")
I have two kind of files, excel and csv which I am using to read data with two permanent columns : Question, Answer and two temporary columns which may or may not be present Word and Replacement.
I have made different functions to read data from csv and excel file which will be called based on the extension of file.
Is there a way to read the data from temporary columns(Word and Replacement) based on when they are present and when they are not. Please see the function definition below :
1) For CSV file:
def read_csv_file(path):
quesData = []
ansData = []
asciiIgnoreQues = []
qWithoutPunctuation = []
colnames = ['Question','Answer']
data = pandas.read_csv(path, names = colnames)
quesData = data.Question.tolist()
ansData = data.Answer.tolist()
qWithoutPunctuation = quesData
qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation]
for x in qWithoutPunctuation:
asciiIgnoreQues.append(x.encode('ascii','ignore'))
return asciiIgnoreQues, ansData, quesData
2) Function to read excel data:
def read_excel_file(path):
book = open_workbook(path)
sheet = book.sheet_by_index(0)
quesData = []
ansData = []
asciiIgnoreQues = []
qWithoutPunctuation = []
for row in range(1, sheet.nrows):
quesData.append(sheet.cell(row,0).value)
ansData.append(sheet.cell(row,1).value)
qWithoutPunctuation = quesData
qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation]
for x in qWithoutPunctuation:
asciiIgnoreQues.append(x.encode('ascii','ignore'))
return asciiIgnoreQues, ansData, quesData
I'm not entirely sure what you tried to achieve, but reading and transforming you data, the pandas way, is done as following:
def read_file(path, typ):
if typ == "excel":
df = pd.read_excel(path, sheetname=0) # Default is zero
else: # Assuming "csv". You can make it explicit
df = pd.read_csv(path)
qWithoutPunctuation = df["Question"].apply(lambda s: ''.join(c for c in s if c not in string.punctuation))
df["asciiIgnoreQues"] = qWithoutPunctuation.apply(lambda x: x.encode('ascii','ignore'))
return df
# Call it like this:
read_data("file1.csv","csv")
read_data("file2.xls","excel")
read_data("file2.xlsx","excel")
This will return you a DataFrame with the columns ["Question","Answer", "asciiIgnoreQues"] in case your data didn't include Word and Replacement, and ["Question", "Word", "Replacemen", "Answer", "asciiIgnoreQues"] if it did.
Note that I've used apply which enables you to run a function elementwise on all the series.