Begin Code
import pandas as pd
df = pd.read_csv('C:/Users/lhicks/Documents/Corporate/test.csv', 'r')
saved_column = df.FirstName
saved_column2 = df.LastName
saved_column3 = df.Email
print saved_column
print saved_column2
print saved_column3
Itemlist = []
Itemlist.append(saved_column)
print Itemlist
End of Code
The objective is to select specific columns from a specified xls sheet, grab all the rows from the specified columns, and then print that data out.
The current issue is the data is grabbed correctly, but after 29-30 rows, it prints/stores a "...", and then jumps to line item 880s, and finishes out from there.
The additional issue is that it also stores this as the value, making it worthless due to not providing the full dataset.
The eventual process is to add the selected columns to a new xls sheet to clean up the old data, and then add the rows to a templated document to generate an advertisement letter.
The first question is how to I have all the fields populate? The second is what is the best approach for this? Please provide additional links as well if possible, this is a practical learning experience for me.
Pandas tries to shorten your data when printing it.
NOTE: all the data is still there (print(df.shape) - to check it, print the shape of your DataFrame), it's just a convenient way not to flood your screen with tons of data rows/columns
Try this:
fn = 'C:/Users/lhicks/Documents/Corporate/test.csv'
cols = ['FirstName','LastName','Email']
df = pd.read_csv(fn, usecols=cols)
df.to_excel('/path/to/excel.xlsx', index=False)
This will parse only ['FirstName','LastName','Email'] columns from a CSV file and will export them to Excel file
UPDATE:
if you want to control how many rows should Pandas print:
with pd.option_context("display.max_rows",200):
print(df)
Related
I have a CSV file which I want to normalize for SQL input. I want to drop every line, where's the column count not equal to a certain number within a row, this way I can ignore the bad lines, where column shift can happen. In the past, I used AWK to normalize this CSV dataset, but I want to implement this program in Python for easier parallelization other than GNU Parallel + AWK solution.
I tried the following codes to drop the lines:
df.drop(df[df.count(axis='columns') != len(usecols)].index, inplace=True)
df = df[df.count(axis=1) == len(usecols)]
df = df[len(df.index) == len(usecols)]
None of this work, I need some help, Thank You!
EDIT:
I'm working on a single CSV file on a single worker.
EDIT 2:
Here is the awk script for reference:
{
line = $0;
# ...
if (line ~ /^$/) next; # if line is blank, then remove it
if (NF != 13) next; # if column count is not equal to 13, then remove it
}
The question is not easy to understand. From the first statement it appears as if you are working with a single file, is that correct?
If so, if there are unnamed columns, then there will be an attempt by pandas (or dask via pandas) to 'fix' the structure by adding missing column labels with something like 'Untitled: 0'. Once that happens, it's easy to drop the misaligned rows by using something like:
mask = df['Untitled: 0'].isna()
df = df[mask]
Edit: if there are rows that contain more entries than the number of defined columns, pandas will raise an error, saying it was not able to parse csv.
If, however, you are working with multiple csv files, then one option is to use dask.delayed to enforce compatible columns, see this answer for further guidance.
It's easier to post a separate answer, but it seems that this problem can be solved by passing on_bad_lines kwarg to pandas.read_csv (note: if you are using pandas version lower than 1.3.0, you will need to use error_bad_lines). Roughly, the code would look like this:
from pandas import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
Since dask.dataframe can pass kwargs to pandas, the above can also be written for dask.dataframe:
from dask.dataframe import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
With this, the imported csv will not reflect any lines that have more columns than expected based on the header (if there is a line with fewer elements than the number of columns, it will be included such that the missing values will be set to None).
I ended up creating a function which pre-processing the zipped CSV file for Pandas/Dask. These are not CPU/Memory heavy tasks, parallelization is not important in this step, so until there's no better way to do this, here we are. I'm adding a proper header for my pre-processed CSV file, too.
with open(csv_filename, 'wt', encoding='utf-8', newline='\n') as file:
join = '|'.join(usecols)
file.write(f"{join}\n") # Adding header
with ZipFile(destination) as z:
with TextIOWrapper(z.open(f"{filename}.csv"), encoding='utf-8') as f:
for line in f:
line = line.strip() # Remove whitespace from line
if line not in ['\n', '\r\n']: # Exclude empty line
array = line.split("|")
if len(array) == column_count:
del array[1:3] # Remove 1st, 2nd element
array = [s.strip() for s in array] # Strip whitespace
join = '|'.join(array)
file.write(f"{join}\n")
# file.close()
PS.: This is not an answer for my original question, that's why I won't accept this.
How could I read an Excel file from column AF and onwards? I don't know the last column letter name and the file is too large to constantly keep checking.
df = pd.read_excel(r"Documents\file.xlsx", usecols="AF:")
You can't write it directly in read_excel function so we can only look for other possible options.
For the moment we could write 'AF:XFD' because 'XFD' is the last column in excel, but it returns information that it will be depracated soon and start returning ParseError, so it's not recommended.
You can use other libraries to find the last column, but it doesn't work too fast - excel file is read, then we check last column and after that we create a dataframe.
If I had such problem I would do everything in Pandas, by adding .iloc at the end. We know that 'AF' is 32th column in excel, so:
df = pd.read_excel(r"Documents\file.xlsx").iloc[:, 32:]
It will return all columns from 'AF' till the end without having a noticeable impact on performance.
You can use the xlwings library to determine the last column in which there is data and then replace it in your code line.
import xlwings as xw
app = xw.App(visible = False)
book = app.books.open(r"Documents\file.xlsx")
sheet = xw.Book(r"Documents\file.xlsx").sheets[0]
app.quit()
colNum = sheet.range('AF1').current_region.last_cell.column
df = pd.read_excel(r"Documents\file.xlsx", usecols = list(range(32, colNum, 1)))
where 32 corresponds to the column number for AF.
I need to add a row above the header columns in a dataframe, which will be converted to an excel file, with the limitation that there cannot be any reading / writing of files locally. Because of this, I am unable to use open('filename.xls', 'w') as f:
This is because the script is to be run in a place where files cannot be read/written locally.
So for example, I want something like this
text here
*animal* *no_of_legs* *name*
cat 4 meow
bird 2 chirp
rabbit 2 bun
I have an array allAnimals consisting of all the animals data.
I tried allAnimals.insert(0,['text here]) then df = pd.DataFrame(allAnimals, columns=['animals', 'no_of_legs', 'name']) to convert as a dataframe. I then use df.to_excel(xxx, index=False) but I get something like this instead:
*animal* *no_of_legs* *name*
text here
cat 4 meow
bird 2 chirp
rabbit 2 bun
Alternatively, another method I tried involved creating a new dataframe storing only 'text here', and then tried to use concat but it doesn't add the data horizontally. It adds a new column instead. The same goes for append. So this would be what I get:
*animal* *no_of_legs* *name*
text here
cat 4 meow
bird 2 chirp
rabbit 2 bun
I have read some other questions similar to this but they are not very applicable to my case, as such, I am unable to solve this. Any tips would be appreciated!
Is it possible to add a row above the headers in pandas
Add rows *on top of* column names Pandas Dataframe as header info?
How to add an empty row on top of the header row in a dataframe?
Add rows back to the top of a dataframe
You can edit your dataframe's column with an additional column index level by making it a MultiIndex. Assign df.columns to be pd.MultiIndex:
Attempting to reproduce the example:
import pandas as pd
allAnimals = [['cat', 4, 'meow'], ['bird', '2', 'chirp'], ['rabbit', 2 ,'bun']]
df = pd.DataFrame(allAnimals, columns= ['animals', 'no_of_legs', 'name'])
df.columns = pd.MultiIndex.from_tuples(
zip(['text here', '', ''],
df.columns))
df
Output:
Remove the index column accordingly if needed.
This may be of help to you: working with pandas additional column index
The above answer was a big step forward in answering the question, however, it involved multi-index which then disallowed me from using index=False in df.to_excel(xxx, index=False). This led to the excel file being off-format, since it is now index=True.
This means I got an excel file like this:
Honestly, the only complication here is the limitation of not being able to read/write locally.
I went on to research on how to fix this, and came across some articles:
How to hide the rows index
Get rid of index while outputting multi header pandas dataframe to excel
I couldn't implement the first website's answer because I couldn't supply a local path. I decided to use the second website's first workaround, which involved using openpyxl and using bytesIO.
In case anyone needs this in the future, here's the implementation:
from openpyxl import load_workbook, Workbook
in_memory_fp = io.BytesIO()
df.to_excel(in_memory_fp)
in_memory_fp.seek(0,0)
workbook = load_workbook(in_memory_fp)
worksheet = workbook['Sheet1']
worksheet.delete_cols(1)
worksheet.delete_rows(3)
for i in worksheet.values:
print(i)
workbook.save(in_memory_fp)
and then now, the excel will be nicely formatted like this:
I have two excel worksheets I am reading in Python. The first worksheet has a list of companies names. The second is a sheet with multiple of the same companies' names and data to the right that corresponds to the row.
[![Worksheet 1][1]][1]
[![Worksheet 2][2]][2]
I want to make some kind of condition, if the name in column A WS 2 matches the name in WS 1, then print the data (columns A:F WS 2) only for the rows corresponding to the name.
I am pretty new to coding, so I've been playing with it a lot without finding much luck. Right now I don't have much code because I tried restarting again. Been trying to use just pandas to read, sometimes I've been trying openpyxl.
import pandas as pd
import xlsxwriter as xlw
import openpyxl as xl
TickList = pd.read_excel("C:\\Users\\Ashley\\Worksheet1.xlsx",sheet_name='Tickers', header=None)
stocks = TickList.values.ravel()
Data = pd.read_excel("C:\\Users\\Ashley\\Worksheet2.xlsx", sheet_name='Pipeline', header=None, usecols="A:F")
data = Pipeline.values.ravel()
for i in stocks:
for t in data:
if i == t:
print(data)
[1]: https://i.stack.imgur.com/f6mXI.png
[2]: https://i.stack.imgur.com/4vKGR.png
I would imagine that the first thing you are doing wrong is not stipulating the key value on which the "i" in stocks is meant to match on the values in "t". Remember - "t" are the values - all of them. You have to specify that you wish to match the value of "i" to (probably) the first column of "t". What you appear to be doing here is akin to a vlookup without the target range properly specified.
Whilst I do not know the exact method in which the ravel() function stores the data, I have to believe something like this would be more likely to work:
for i in stocks:
for t in data:
if i == t[0]:
print(t)
I have been trying to look for an approach that will allow me to load only those columns from csv file which satisfy certain condition while I create a DataFrame.. something which can skip the unwanted columns because I have large number of columns and only some are actually useful for testing purposes. And also to load those columns which have mean > 0.0. The ideas is like we skip certain number of rows or read first nrows... but I am looking for condition based filtering for columns' names and values.
Is this actually possible for Pandas? To do things on-fly accumulating results first without loading everything into memory?
There's no direct/easy way of doing that (that i know of)!
The first function idea that comes to mind is: to read the first line of the csv (i.e. read the headers) then create a list using list comprehension for your desired columns :
columnsOfInterest = [ c for c in df.columns.tolist() if 'node' in c]
and get their position in the csv. With that, you'll now have the columns/position so you can only read those from your csv.
However, the second part of your condition which needs to calculate the mean, unfortunately you'll have to read all data for these column, run the mean calculations and then keep those of interest (where mean is > 0). But after all, that's to my level of knowledge, maybe someone else has away of doing this and can help you out, good luck!
I think usecols is what you are looking for.
df = pandas.read_csv('<file_path>', usecols=['col1', 'col2'])
You could preprocess the column headers using the csv library first.
import csv
f = open('data.csv', 'rb')
reader = csv.reader(f)
column_names = next(reader, None)
filtered_columns = [name for name in column_names if 'some_string' in name]
Then proceed using usecols from pandas as abhi mentioned.
df = pandas.read_csv('data.csv', usecols=filtered_columns)