Coding in Python 2.7. I have a csv file already present called input.csv. In that file we have 3 headers — Filename, Category, Version — under which certain values already exist.
I want to know how I can reopen the csv file and input only one value multiple times under the "Version" column such that whatever was written under "Version" gets overwritten by the new input.
So suppose under the "Version" column I had 3 inputs in 3 rows:
VERSION
55
66
88
It gets rewritten by my new input 10 so it will look like:
VERSION
10
10
10
I know normally we input csv row-wise but this time around I just want to input column wise under that specific header "Version".
Solution 1:
With pandas, you can use:
import pandas as pd
df = pd.read_csv(file)
df['VERSION'] = 10
df.to_csv(file, index=False)
Solution 2: If there are multiple rows (and you only want first 3), then you can use:
df.loc[df.index < 3, ['VERSION']] = 10
instead of:
df['VERSION'] = 10
Related
I have multiple inventory tables like so:
line no
-1 qty
-2 qty
1
-
3
2
42.1 FT
-
3
5
-
4
-
10 FT
5
2
1
6
6.7
-
or
line no
qty
1
2
2
4.5 KG
3
5
4
5
13
6
AR
I want to create logic check for the quantity column using python. (The table may have more than one qty column and I need to be able to check all of them. In both examples, I have the tables formatted as dataframes.)
Acceptable criteria:
integer with or without "EA" (meaning each)
"AR" (as required)
integer or float with unit of measure
if multiple QTY columns, then "-" is also accepted (first table)
I want to return a list per page, containing the line no. corresponding to rows where quantity value is missing (line 4, second table) or does not meet acceptance criteria (line 6, table 1). If the line passes the checks, then return True.
I have tried:
qty_col = [col for col in df.columns if 'qty' in col]
df['corr_qty'] = np.where(qty_col.isnull(), False, df['line_no'])
but this creates the quantity columns as a list and yields the following
AttributeError: 'list' object has no attribute 'isnull'
Intro and Suggestions:
Welcome to StackOverflow. Some general tips when asking questions on S.O. include as much information as possible. In addition, always identify the libraries you want to use and the accepted approach since there can be multiple solutions to the same problem, looks like you've done that.
Also, it is best to always share all, if not, most of your attempted solutions so others can understand the thought process and fully understand the best approach to provide a potential solution.
The Solution:
It wasn't clear if the solution you are looking for required that you read the PDF to create the dataframe or if converting the PDF to a CSV and processing the data using the CSV was sufficient. I took the latter approach.
import tabula as tb
import pandas as pd
#PDF file path
input_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.pdf"
#CSV file path
output_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.csv"
#Read the PDF
#id = tb.read_pdf(input_file_path, pages='all')
#Convert the PDF to CSV
cv = tb.convert_into(input_file_path, output_file_path, "csv", pages="all")
#Read initial data
id = pd.read_csv(output_file_path, delimiter=",")
#Print the initial data
print(id)
#Create the dataframe
df = pd.DataFrame(id, columns = ['qty'])
#Print the data as a DataFrame object; boolean values when conditions met
print(df.notna())
How to implement a joining of 2 text files using python and output a third file but only adding values present in one file that have corresponding matching value in second file?
Input File1.txt:
GameScore|KeyNumber|Order
85|2568909|2|
84|2672828|1|
80|2689999|5|
65|2123232|3|
Input File2.txt:
KeyName|RecordNumber
Andy|2568909|
John|2672828|
Andy|2672828|
Megan|1000021|
Required Output File3.txt:
KeyName|KeyNumber|GameScore|Order
Andy|2672828|84|1|
Andy|2568909|85|2|
John|2672828|84|1|
Megan|1000021||
Look for a key name and record number in File 2 and match it with KeyNumber in file 1 and copy the corresponding game score and order values.
The files have anywhere from 1 to 500000 records so need to be able to run for a large set.
Edit: I do not have access to any libraries like pandas and not allowed to install any libraries.
Essentially need to run a cmd that will trigger a program that does the reads 2 files, compares and generates the third file.
You can use pandas to do this:
import pandas as pd
df1 = pd.read_csv('Input File1.txt', sep='|')
df2 = pd.read_csv('Input File2.txt', sep='|', header=0, names=['KeyName', 'KeyNumber'])
df3 = df1.merge(df2, on='KeyNumber', how='right')
See the documentation for fine-tuning.
To aggregate and to find values per second, I am doing following in Python using pandas, however, in the output logged to a file doesn't show columns in the way they appear here. Somehow these column names are sorted and hence TotalDMLsSec shows up before UpdateTotal and UpdatesSec.
'DeletesTotal': x['Delete'].sum(),
'DeletesSec': x['Delete'].sum()/VSeconds,
'SelectsTotal': x['Select'].sum(),
'SelectsSec': x['Select'].sum()/VSeconds,
'UpdateTotal': x['Update'].sum(),
'UpdatesSec': x['Update'].sum()/VSeconds,
'InsertsTotal': x['Insert'].sum(),
'InsertsSec': x['Insert'].sum()/VSeconds,
'TotalDMLsSec':(x['Delete'].sum()+x['Update'].sum()+x['Insert'].sum())/VSeconds
})
)
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
Apart from above questions, have couple of other questions-
Despite logging as csv format, all values/columns are appearing in one column in excel, is there anyway to properly load data CSV
Can rows be sorted based on one column(let say InsertsSec) by default when writing to csv file?
Any help here would be really appreciated.
Assume that your DataFrame is something like this:
Deletes Selects Updates Inserts
Name
Xxx 20 10 40 50
Yyy 12 32 24 11
Zzz 70 20 30 20
Then both total and total per sec can be computed as:
total = df.sum().rename('Total')
VSeconds = 5 # I assumed some value
tps = (total / VSeconds).rename('Total per sec')
Then you can add both above rows to the DataFrame:
df = df.append(totals).append(tps)
The downside is that all numbers are converted to float.
But in Pandat there is no other way, as each column must have
values of one type.
Then you can e.g. write it to a CSV file (with totals included).
This is how I eneded up doing
df.to_excel(vExcelFile,'All')
vSortedDF=df.sort_values(['Deletes%'],ascending=False)
vSortedDF.loc[vSortedDF['Deletes%']> 5, ['DeletesTotal','DeletesSec','Deletes%']].to_excel(vExcelFile,'Top Delete objects')
vExcelFile.save()
For CSV, instead of using separate \t used , and it worked just fine.
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
I've been trying to sort my spreadsheet by 4 columns. Using win32com, I have managed to sort by 3 columns using the below code:
excel = win32com.client.Dispatch("Excel.Application")
wb= excel.Workbooks.Open('.xlsx')
ws= wb.worksheets[0]
ws.Range(D6:D110).Sort(Key1=ws.Range('D1'), Order1=1, Key2=ws.Range('E1'), Order2=2, Key3=ws.Range('G1'), Order3=3, Orientation=1)
However, when I try to add Key4, it says Key4 is an unexpected keyword argument. Is the Range.Sort function limited to only 3 keys? Is there a way to add a 4th?
Is there maybe another way to do this using pandas or openpyxl?
Thanks in advance!
Try reading in the excel sheet then sorting by header names. This assumes that your excel sheet is formatted correctly like a CSV.
import pandas as pd
df = pd.read_excel('filename.xlsx')
df = df.sort_values(['key1','key2','key3','key4'], axis=1)
df.to_excel('filename2.xlsx')
Simply sort twice or however many times needed in series of three keys.
xlAscending = 1
xlSortColumns = 1
xlYes = 1
ws.Range(D6:D110).Sort(Key1=ws.Range('D1'), Order1=xlAscending,
Key2=ws.Range('E1'), Order2=xlAscending,
Key3=ws.Range('G1'), Order3=xlAscending,
header=xlYes, Orientation=xlSortColumns)
# FOURTH SORT KEY (ADJUST TO NEEDED COLUMN)
ws.Range(D6:D110).Sort(Key1=ws.Range('H1'), Order1=xlAscending,
header=xlYes, Orientation=xlSortColumns)
By the way your Order numbers should only be 1, 2, or -4135 per the xlSortOrder constants.
I am new to python and have looked at a number of similar problems on SO, but cannot find anything quite like the problem that I have and am therefore putting it forward:
I have an .xlsx dataset with data spread across eight worksheets and I want to do the following:
sum the values in the 14th column in each worksheet (the format, layout and type of data (scores) is the same in column 14 across all worksheets)
create a new worksheet with all summed values from column 14 in each worksheet
sort the totaled scores from highest to lowest
plot the summed values in a bar chart to compare
I cannot even begin this process because I am struggling at the first point. I am using pandas and am having trouble reading the data from one specific worksheet - I only seem to be able to read the data from the first worksheet only (I print the outcome to see what my system is reading in).
My first attempt produces an `Empty DataFrame':
import pandas as pd
y7data = pd.read_excel('Documents\\y7_20161128.xlsx', sheetname='7X', header=0,index_col=0,parse_cols="Achievement Points",convert_float=True)
print y7data
I also tried this but it only exported the entire first worksheet's data as opposed to the whole document (I am trying to do this so that I can understand how to export all data). I chose to do this thinking that maybe if I exported the data to a .csv, then it might give me a clearer view of what went wrong, but I am nonethewiser:
import pandas as pd
import numpy as np
y7data = pd.read_excel('Documents\\y7_20161128.xlsx')
y7data.to_csv("results.csv")
I have tried a number of different things to try and specify which column within each worksheet, but cannot get this to work; it only seems to produce the results for the first worksheet.
How can I, firstly, read the data from column 14 in every worksheet, and then carry out the rest of the steps?
Any guidance would be much appreciated.
UPDATE (for those using Enthought Canopy and struggling with openpyxl):
I am using Enthought Canopy IDE and was constantly receiving an error message around openpyxl not being installed no matter what I tried. For those of you having the same problem, save yourself lots of time and read this post. In short, register for an Enthought Canopy account (it's free), then run this code via the Canopy Command Prompt:
enpkg openpyxl 1.8.5
I think you can use this sample file:
First read all columns in each sheet to list of columns called y7data:
y7data = [pd.read_excel('y7_20161128.xlsx', sheetname=i, parse_cols=[13]) for i in range(3)]
print (y7data)
[ a
0 1
1 5
2 9, a
0 4
1 2
2 8, a
0 5
1 8
2 5]
Then concat all columns together, I add keys which are used for axis x in graph, sum all columns, remove second level of MultiIndex (a, a, a in sample data) by reset_index and last sort_values:
print (pd.concat(y7data, axis=1, keys=['a','b','c']))
a b c
a a a
0 1 4 5
1 5 2 8
2 9 8 5
summed = pd.concat(y7data, axis=1, keys=['a','b','c'])
.sum()
.reset_index(drop=True, level=1)
.sort_values(ascending=False)
print (summed)
c 18
a 15
b 14
dtype: int64
Create new DataFrame df, set column names and write to_excel:
df = summed.reset_index()#.
df.columns = ['a','summed']
print (df)
a summed
0 c 18
1 a 15
2 b 14
If need add new sheet use this solution:
from openpyxl import load_workbook
book = load_workbook('y7_20161128.xlsx')
writer = pd.ExcelWriter('y7_20161128.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, "Main", index=False)
writer.save()
Last Series.plot.bar:
import matplotlib.pyplot as plt
summed.plot.bar()
plt.show()
From what I understand, your immediate problem is managing to load the 14th column from each of your worksheets.
You could be using ExcelFile.parse instead of read_excel and loop over your sheets.
xls_file = pd.ExcelFile('Documents\\y7_20161128.xlsx')
worksheets = ['Sheet1', 'Sheet2', 'Sheet3']
series = [xls_file.parse(sheet, parse_cols=[13]) for sheet in worksheets]
df = pd.DataFrame(series)
And from that, sum() your columns and keep going.
Using ExcelFile and then ExcelFile.parse() has the advantage to load your Excel file only once, and iterate over each worksheet. Using read_excel makes your Excel file to be loaded in each iteration, which is useless.
Documentation for pandas.ExcelFile.parse.