Use python to extract information and data from excel file - python

I have encountered a problem on extracting the data from a database1.csv file. My database1.csv file contain a million of data and I need to extract out certain column of data which I need for it. The following figure is my coding and I found an error when running the coding. The error I got is Error: unknown dialect.
For your information:
1) I need to extract out the entire certain column which contain the information "GWM" from database1.csv file.
2) After I extracted the data and I need to put all of them into a new excel file which is result.csv file.
3) The word "GWM" is the word that I selected to extract out the certain entire column
Any recommended suggestion can be used to improve and edit my coding? Thanks. enter image description here

Make sure you have Python 3 (most recent version) installed and have a command line window open in the folder your file is in.
Install pandas via pip or pip3, whichever works. (pip install pandas)
The code below, if saved and run in the same directory as your .xlsx file, will extract all your columns to .dat files, the filenames being the first row in said columns. From there, just choose the file you want.
import pandas as pd
xlsxname = input('File: ')
datacols = pd.read_excel(xlsxname, low_memory=False)
cols = list(datacols)
lencols = len(cols)
countup = 0
while countup != lencols:
colstemp = cols[countup]
data = pd.read_excel(xlsxname,
low_memory=False,
usecols=[colstemp])
colsname = f'{colstemp}.dat'
data.to_excel(colsname, index=False, header=False)
countup = countup + 1
It may be ugly, it may be an idiotic and poorly-coded solution (why not just select a specific column?), but hey, it works.
(in Excel)
...You could also left-click the number at the top of the column you want, press Ctrl-C, and paste it into a text editor, but hey...

Related

Python reads only one column from my CSV file

first post here.
I am very new to programmation, sorry if it is confused.
I made a database by collecting multiple different data online. All these data are in one xlsx file (each data a column), that I converted in csv afterwards because my teacher only showed us how to use csv file in Python.
I installed pandas and make it read my csv file, but it seems it doesnt understand that I have multiple columns, it reads one column. Thus, I can't get the info on each data (and so i can't transform the data).
I tried df.info() and df.info(verbose=True, show_counts=True) but it makes the same thing
len(df.columns) = 1 which proves it doesnt see that each data has its own column
len(df) = 1923, which is right
I was expecting that : https://imgur.com/a/UROKtxN (different project, not the same database)
database used: https://imgur.com/a/Wl1tsYb
And I have that instead : https://imgur.com/a/iV38YNe
database used: https://imgur.com/a/VefFrL4
idk, it looks pretty similar, why doesn't work :((
Thanks.

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

deleting some rows from .csv file cause adding NaN columns to it

python version: 3.7.11
pandas version: 1.1.3
IDE: Jupyter Notebook
Software for opening and resaving the .csv file: Microsoft Excel
I have a .csv file. You can download it from here: https://icedrive.net/0/35CvwH7gqr
In .csv file, I looked for rows that have blank cells and after finding that rows I deleted them. To do this I follow bellow instruction:
I Opened .csv file with Microsoft Excel.
I pressed F5, then in the "Reference" field I wrote "A1:E9030", then I clicked on ok.
I pressed F5 again, then clicked on "Special..." button, select "Blanks", then clicked on ok
In the "Home" tab from "Cells", I clicked "Delete", then "Delete Sheet Rows"
saved the file and closed it.
This is the file after deleting some rows: https://icedrive.net/0/cfG1dT6bBr
but when I run bellow code, it seems that extra columns are added after deleting some rows.
import pandas as pd
# The file doesn't have any header.
my_file = pd.read_csv(path_to_my_file, header=None)
my_file.head()
print(my_file.shape)
The output:
(9024, 244)
You can also see the difference by opening the file with notepad:
.csv file before deleting some rows:
.csv file after deleting some rows:
before deleting the rows the my_file.shape shows me 5 columns but after deleting some rows it shows me 244 for number of columns.
Question:
How to remove rows in excel or with other ways so I won't end up with this problem?
Note: I can't remove these rows with pandas because pandas automatically doesn't take into account these rows so I should do this manually.
Thanks in advance for any help.
I am not familiar with the operation you are carrying out in the first part of your question, but I suggest a different solution. Pandas will recognize only np.nan objects as null. So, in this case, we could start by loading the .csv file into Pandas first and replace the empty cells with np.nan values:
>>> import pandas as pd
>>> import numpy as np
>>> my_file = pd.read_csv(path_to_my_file, header=None)
>>> my_file = my_file.replace('', np.nan, inplace=True)
Then, we could ask pandas to drop all the rows containing np.nan:
>>> my_file = my_file.dropna(inplace=True)
This should give you the desired output. I think is a good habit to work on data frames from your IDE directly. Hope this helped!

pd.read_excel does recognize the file but does not actually read it

I've been busy working on some code and one part of it is importing an excel file. I've been using the code below. Now, on one pc it works but on another it does not (I did change the paths though). Python does recognize the excel file and does not give an error when loading, but when I print the table it says:
Empty DataFrame
Columns: []
Index: []
Just to be sure, I checked the filepath which seems to be correct. I also checked the sheetname but that is all good too.
df = pd.read_excel(book_filepath, sheet_name='Potentie_alles')
description = df["#"].map(str)
The key error '#' (# is the header of the first column of the sheet).
Does anyone know how to fix this?
Kind regards,
iCookieMonster

How to correct DataFrame header containing whitespaces.

I am new to python and actually started with R.
My problem is that I am unable to debug key errors from my pandas dataframes. Here is part of the code:
I read in a data frame from excel with following commands.
cwd = os.getcwd()
os.chdir(directorytofile)
os.listdir('.')
file = dataset
xl = pd.ExcelFile(file)
df1 = cl.parse('Sheet1')
Now when i want to select a header with a blank space like
Lieferung angelegt am
(It's German sorry for that)
I get the key error. I tried different ways to delete blank spaces in my headers when building the dataframe like:
sep='\s*,\s*'
But it still occurs. Is there a way for me to see where the problems happen?
Obviously its about the blank spaces because for headers without everything works fine.

Categories