I am new to Python/Pandas. I am wondering if there's a code that can help me fix how the columns move to the right inside the .csv we pull out of our systems - one column is filled with user input (containing messy characters ",) so usually after loading the user input column spreads out on several columns instead of one, wrongly moving out to the right the other columns as well.
I fix this manually in excel, manually filtering, deleting, moving the columns to their correct place - it takes 20 mins / day.
I would like to ask advice if there is code which I could try to clean and arrange correctly the columns or if it is easier the manual fix in excel as I do it now. Thank you!
pandas is altering the columns because it sees 'separators' in the import file.
In Excel, for each newline, count how many times a comma appears. Using your example above there should be 3 per line.
My quick and dirty solution would be replace the last three commas in your file with a character that is almost impossible for a user to type, I typically go for a pipe '|' character.
Try importing that into pandas, specifying a new delimier/separator example below:
import pandas as pd
df = pd.read_csv(filepath, sep="|")
df.head()
You cannot play with the layout with CSV that is a pure data transport format. Hopefully, there are 3rd party libs that can play with .xlsx files here and here.
Related
I am facing a small problem while reading columns.
My columns are ["onset1", "onset2", "onset3"], and I want to read the values from excel. But each of the Dataframe has different column names so I need to change the name each time, it's a waste of time.
Wondering if they are any way to read in an efficient way instead of reading df["onset1"].iloc[-1], df["onset2"].iloc[-1]....
(I am thinking of reading the top of the alphabet, like df["V].iloc[-1], df["W].iloc[-1] )
I have a very simple task: I need to take a sum of 1 column in a file that has many columns and thousand of rows. However, every time I open the file on jupyter, it crashes since I cannot go over 100 MB per file.
Is there any work around for such a task? I feel I shouldnt have to open the entire file since I need just 1 column.
Thanks!
I'm not sure if this will work since the information you have provided is somewhat limited, but if you're using python 3 I had a similar issue. Try typing this at the top and see if this helps. It might fix your issue.
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
The above solution is sort of a band-aid and isn't supported and may cause undefined behavior. If your data is too big for your memory try reading in the data with dask.
import dask.dataframe as dd
dd.read_csv(path, params)
You have to open the file even if you want just one row, .. opening it load it into some other memory and here is your problem .
You can either open the file outside Ipython and split it to smaller size OR
Use a library like pandas and read it in chunks , as in the answer
You should slice through rows and put it in different other data frames and then works on respective data frames.
Hanging issues are because of RAM insufficiency in your system.
Use new_dataframe = dataframe.iloc[: , :]- or new_dataframe = dataframe.loc[: , :]-methods for slicing in pandas.
Rows slicing before colon and column slicing after colon.
Does anyone knows how can I insert a dataframe into an excel in a desired position ?
For example, I would like to start my dataframe into the cell "V78"
there is startrow and startcol argument in the .to_excel() method
df.to_excel('excel.xls', startrow=78, startcol=24)
I have a solution which may or may not fit your requirements.
I would not directly import it into an existing Excel file which may contain valuable data and furthermore keeping the files separate may be of use one day.
You could simply save the dataframe as an Excel file;
df.to_excel('df.xls')
And in the Excel file that you want to insert it into create an object of type file and link the two that way. See here.
Personally keeping them separate seems better as once two files become one there is no going back. You could also have multiple files this way for easy comparisons, without fiddling row/column numbers!
Hope was of some help!
I have a historic csv that needs to be updated daily (concatenated) with a freshly pulled csv. The issue is that the new csv may have different number of columns from the historic one. If each of them was light, I could just read in both and concatenate with pandas. If the number of columns was the same, I could use cat and do a command-line call. Unfortunately, neither is true.
So, I am wondering if there is a way to do out-of-memory concatenation/join with pandas for something like above, or using one of the command line tools.
Thanks!
How would I go around creating a MYSQL table schema inspecting an Excel(or CSV) file.
Are there any ready Python libraries for the task?
Column headers would be sanitized to column names. Datatype would be estimated based on the contents of the spreadsheet column. When done, data would be loaded to the table.
I have an Excel file of ~200 columns that I want to start normalizing.
Use the xlrd module; start here. [Disclaimer: I'm the author]. xlrd classifies cells into text, number, date, boolean, error, blank, and empty. It distinguishes dates from numbers by inspecting the format associated with the cell (e.g. "dd/mm/yyyy" versus "0.00").
The job of programming some code to wade through user-entered data to decide on what DB datatype to use for each column is not something that can be easily automated. You should be able to eyeball the data and assign types like integer, money, text, date, datetime, time, etc and write code to check your guesses. Note that you need to able to cope with things like numeric or date data entered in text fields (can look OK in the GUI). You need a strategy to handle cells that don't fit the "estimated" datatype. You need to validate and clean your data. Make sure you normalize text strings (strip leading/trailing whitespace, replace multiple whitespaces by a single space. Excel text is (BMP-only) Unicode; don't bash it into ASCII or "ANSI" -- work in Unicode and encode in UTF-8 to put it in your database.
Quick and dirty workaround with phpmyadmin:
Create a table with the right amount of columns. Make sure the data fits the columns.
Import the CSV into the table.
Use the propose table structure.
As far as I know, there is no tool that can automate this process (I would love for someone to prove me wrong as I've had this exact problem before).
When I did this, I came up with two options:
(1) Manually create the columns in the db with the appropriate types and then import, or
(2) Write some kind of filter that could "figure out" what data types the columns should be.
I went with the first option mainly because I didn't think I could actually write a program to do the type inference.
If you do decide to write a type inference tool/conversion, here are a couple of issues you may have to deal with:
(1) Excel dates are actually stored as the number of days since December 31st, 1899; how does one infer then that a column is dates as opposed to some piece of numerical data (population for example)?
(2) For text fields, do you just make the columns of type varchar(n) where n is the longest entry in that column, or do you make it an unbounded char field if one of the entries is longer than some upper limit? If so, what's a good upper limit?
(3) How do you automatically convert a float to a decimal with the correct precision and without loosing any places?
Obviously, this doesn't mean that you won't be able to (I'm a pretty bad programmer). I hope you do, because it'd be a really useful tool to have.
Just for (my) reference, I documented below what I did:
XLRD is practical, however I've just saved the Excel data as CSV, so I can use LOAD DATA INFILE
I've copied the header row and started writing the import and normalization script
Script does: CREATE TABLE with all columns as TEXT, except for Primary key
query mysql: LOAD DATA LOCAL INFILE loading all CSV data into TEXT fields.
based on the output of PROCEDURE ANALYSE, I was able to ALTER TABLE to give columns the right types and lengths. PROCEDURE ANALYSE returns ENUM for any column with few distinct values, which is not what I needed, but I found that useful later for normalization. Eye-balling 200 columns was a breeze with PROCEDURE ANALYSE. Output from PhpMyAdmin propose table structure was junk.
I wrote some normalization mostly using SELECT DISTINCT on columns and INSERTing results to separate tables. I have added to the old table a column for FK first. Just after the INSERT, I've got its ID and UPDATEed the FK column. When loop finished I've dropped old column leaving only FK column. Similarly with multiple dependent columns. It was much faster than I expected.
I ran (django) python manage.py inspctdb, copied output to models.py and added all those ForeignkeyFields as FKs do not exist on MyISAM. Wrote a little python views.py, urls.py, few templates...TADA
Pandas can return a schema:
pandas.read_csv('data.csv').dtypes
References:
pandas.read_csv
pandas.DataFrame