Apply on Dataframe passes first row values to all rows - python

When using apply in the below way, the values that get passed as "row" are exclusively those from the first row of the dataframe.
df.apply(make_word_file, axis=1)
Oddly, the file name created in the document.save() is correct. newname has the correct values in row['case_name']. However if I print(row) it prints the values from the first row.
def make_word_file(row):
for key, value in mapfields.items():
# print(row)
regex1 = re.compile(key)
replace1 = str(row[value])
docx_replace_regex(document, regex1 , replace1)
newname = remove(row['case_name'], '\/:*?"<>|,.')
print(newname)
document.save(datadir + row["datename"] + "_" + row["court"] + "_" + newname + ".docx")
I expected print(row) to print the values from each row in the dataframe not just the 1st.
EDIT for clarity:
This script is a mail merge which makes .docx word files.
mapfields is a dict in the format of regex:column name. document is a docx-python object.
mapfields = {
"VARfname": "First Name",
"VARlname": "Last Name",
}

This ended up being a loop/python-docx issue not a pandas one.
The document object was being overwritten, leaving nothing for the regex to find after the first one. Loading the document template in the function fixed the issue.
def make_word_file(case_row):
document_template = Document(directory + fname)
document = document_template
for key, value in mapfields.items():
regex1 = re.compile(key)
replace1 = str(case_row[value])
docx_replace_regex(document, regex1 , replace1)
document.save(location + ".docx")

Related

Export Pandas Dataframe to well-formed CSV

I have a cycle in which on every iteration I export the pandas dataframe to a CSV file. The problem is that i got an output as you see in the first picture, but i need to get something similar to the second one.
I also tried with some encoding type, such as utf-8, utf-16, but nothing changed.
The only difference between my solution and the ones found online is that my dataframe is built from a pickle file, but I don't think this is the problem.
for pickle_file in files:
key = pickle_file.split('/')[5].split('\\')[1] + '_' + pickle_file.split('/')[5].split('\\')[4]
with lz4.frame.open(pickle_file, "rb") as f:
while True:
try:
diz[key].append(pickle.load(f))
except EOFError:
break
for key in diz.keys():
a = diz[key]
for j in range(len(a)):
t = a[j]
for index,row in t.iterrows():
if row['MODE'] != 'biflow':
w = row['W']
feature = row['FEATURE']
mean = row['G-MEAN']
rmse = row['RMSE']
df.loc[-1] = [w] + [feature] + [rmse] + [mean] + [key]
df.index = df.index + 1
df = df.sort_values(by = ['W'])
df.to_csv(path + key + '.csv', index = False)
df = df[0:0]
The data is correctly formed. What you need to do is split each row into columns. In MS Excel it's Data > Text to Columns and then follow the function wizard.
If you are using a different application for opening the data, just google how to split text row data into columns for that application.

Excel parser stuck on one row

So I was making a quick script to loop through a bunch of sheets in an excel file (22 to be exact) and what I wanted to do was the following:
Open the excel sheet and open the sheet named "All" which contained a list of names and then loop through each name and do the following
To loop through all the other 22 sheets in the same workbook and look through each one for the name, which I knew was in the 'B' column.
If the name were to be found, I wanted to take all the columns in that row containing the data for that name and these columns were from A-H
Then copy and paste them next to the original name (same row) in the 'All sheet' while leaving a bit of a space between the original name and the others.
I wanted to do this for all 22 sheets and for the 200+ names listed in the 'All' sheet, my code is as follows:
import openpyxl, pprint
columns = ['A','B','C','D','E','F','G','H']
k = 10
x = 0
def colnum_string(n):
string = ""
while n > 0:
n, remainder = divmod(n - 1, 26)
string = chr(65 + remainder) + string
return string
print("Opening Workbook...")
wb = openpyxl.load_workbook('FileName.xlsx')
sheet_complete = wb.get_sheet_by_name("All")
row_count_all = sheet_complete.max_row
for row in range(4, row_count_all+1):
k = 10
cell = 'B' + str(row)
print(cell)
name = sheet_complete[cell].value
for i in range(2, 23):
sheet = wb.get_sheet_by_name(str(1995 + i))
row_count = sheet.max_row
for row2 in range(2, row_count+1):
cell2 = 'B' + str(row2)
name2 = sheet[cell].value
if name == name2:
x = x + 1
for z in range(0,len(columns)):
k = k + 1
cell_data = sheet[columns[z] + str(row2)].value
cell_target = colnum_string(k) + str(row)
sheet_complete[cell_target] = cell_data
wb.save('Scimago Country Ranking.xlsx')
print("Completed " + str(x) + " Task(s)")
break
The problem is that it keeps looping with the first name only, so it goes through all the names but when it comes to copying and pasting the data, it just redoes the first name so in the end, I end up with all the names in the 'All' sheet and next to each one is the data for the first name repeated over and over. I can't see what's wrong with my code but forgive me if it's a silly mistake as I'm kind of a beginner in these excel parsing scripts. print statements were for testing reasons.
P.S I know I'm using a deprecated function and I will change that, I was just too lazy to do it since it seems to still work fine and if that's the problem then please let me know.

Find list items in excel sheet with Python

I've the following code below which finds non-blank values in Column J of an Excel worksheet. It does some things with it, including getting the value's email address in column K. Then it emails the member using smtp.
What I'd like instead is to get the person's email from a Python list, which can be declared in the beginning of the code. I just can't figure out how to find the matching names in column J in the worksheet per the list, and then get the resulting email address from the list.
Please excuse any horrible syntax...this is my first stab at a major python project.
memlist = {'John Frank':'email#email.com',
'Liz Poe':'email2#email.com'}
try:
for i in os.listdir(os.getcwd()):
if i.endswith(".xlsx") or i.endswith(".xls"):
workbook = load_workbook(i, data_only=True)
ws = workbook.get_sheet_by_name(wsinput)
cell_range = ws['j3':'j7']
for row in cell_range: # This is iterating through rows 1-7
#for matching names in memlist
for cell in row: # This iterates through the columns(cells) in that row
value = cell.value
if cell.value:
if cell.offset(row=0, column =-9).value.date() == (datetime.now().date() + timedelta(days=7)):
#print(cell.value)
email = cell.offset(row=0, column=1).value
name = cell.value.split(',',1)[0]
This is my attempt at an answer.
memlist is not a list, rather it is a dict because it contains key : value pairs.
If you want to check that a certain key exists in a dict, you can use dict.has_key(key) method.
In memlist , the name is the key and the corresponding email is the value.
In your code, you could do this:
if memlist.has_key(cell.value): # For Python 2
if ... # From your code
email = memlist[cell.value]
In case you're using Python 3, you can search for the key like this:
if cell.value in memlist: # For Python 3
See if this works for you as I couldn't fully comprehend your question.
Shubham,
I used a part of your response in finding my own answer. Instead of the has_key method, I just used another for/in statement with a subsequent if statement.
My fear, however, is that with these multiple for's and if's, the code takes a long time to run and maybe not the most efficient/optimal. But that's worthy of another day.
try:
for i in os.listdir(os.getcwd()):
if i.endswith(".xlsx") or i.endswith(".xls"):
workbook = load_workbook(i, data_only=True)
ws = workbook.get_sheet_by_name(wsinput)
cell_range = ws['j3':'j7']
for row in cell_range: # This is iterating through rows 1-7
for cell in row: # This iterates through the columns(cells) in that row
value = cell.value
if cell.value:
if cell.offset(row=0, column =-9).value.date() == (datetime.now().date() + timedelta(days=7)):
for name, email in memlist.items():
if cell.value == name:
#send the email

Write list to specific column in csv

I'm trying to write the data from my list to just column 4
namelist = ['PEAR']
for name in namelist:
for man_year in yearlist:
for man_month in monthlist:
with open('{2}\{0}\{1}.csv'.format(man_year,man_month,name),'w') as filename:
writer = csv.writer(filename)
writer.writerow(name)
time.sleep(0.01)
it outputs to a csv like this
P E A R
4015854 234342 2442343 234242
How can I get it to go on just the 4th column?
PEAR
4015854 234342 2442343 234242
Replace the line writer.writerow(name) with,
writer.writerow(['', '', '', name])
When you pass the name to csvwriter it assumes the name as an iterable and write each character in a column.
So, for getting ride of this problem change the following line:
writer.writerow(name)
With:
writer.writerow([''] * (len(other_row)-1) + [name])
Here other_row can be one of the rest rows, but if you are sure about the length you can do something like:
writer.writerow([''] * (length-1) + [name])
Instead of writing '' to cells you don't want to touch, you could use df.at instead. For example, you could write df.at[index, ColumnName] = 10 which would change only the value of that specific cell.
You can read more about it here: Set value for particular cell in pandas DataFrame using index

Python csv, if check using header

I have a file daiy.csv which is updated on daily basis. The script fetches values out of it, checks against some values and then prints out some info based on those checks. My problem is that almost once a week, the column numbers in daily.csv either increase or decrease. Then I have to find the new column number for my desired values. I was wondering if there is a way I can use header values in my if checks so regardless the column numbers change, I can still run my script.
with open('daily.csv','rb')as f:
reader=csv.reader(f)
#next(reader, None) #Skipping the header
for row in reader:
#if row[3]=='M2' and (float(row[38]) > 60):
try:
if (row[3]=='M2' or row[3]=='M4' or row[3]=='M3') and (float(row[37]) > 60):
print row[1] + "/" + row[2] + "/" + row[3] + " : " + row[37]
if (row[3]=='M2' or row[3]=='M4' or row[3]=='M3') and (float(row[37]) < 70):
print row[1] + "/" + row[2] + "/" + row[3] + " : " + row[37]
except:
pass
This seems like a problem that pandas (http://pandas.pydata.org/) can easily help you solve. Assuming that all of your columns have unique headers you can just extract the relevant columns you care about. Let's say that your column names are 'A', 'B', and 'C' and you don't want to analyze column 'B'.
import pandas as pd
df = pd.read_csv('daily.csv')
columns_to_analyze = ['A', 'C']
df = df[columns_to_analyze]
Without knowing more specifics of how 'daily.csv' is formatted that's as much as I can confidently say...

Categories