what's the best way to clean CSV and load to mysql - python

Please let me know
what's the best way to clean CSV and load to mysql
I am working on loading a couple of different CSVs to mysql database but the CSVs have some anomalies.
note: using pandas read_csv for loading to df and to_sql to load to mysql
I am trying to remove all characters like from csv,
getting data into dataframe with pd.read_csv and within the dataframe trying to do df[col].replace('$','').. does not work on some values unable to find out why. There is no error as such but simply does not remove these characters.
Also the intention is to remove these special characters so accurate data types can be found using SQLalchemy function below.
for col in df.columns:
df[col]=(df[col].replace('$',''))
df[col]=(df[col].replace(',',''))
For finding datatype I am using SQL Alchemy as per below:
pandas to_sql all columns as nvarchar

for a string column, you should use .str
try
df[col]=df[col].str.replace('$','')

Related

Pyspark external table compression does not work

I am trying to save an external table from PySpark in parquet format and I need
to compress it. The PySpark version I am using is 2.4.7. I am updating the table after the initial
creation and appending data in a loop manner.
So far I have set the following options:
.config("spark.sql.parquet.compression.codec", "snappy") df.write.mode("append").format("parquet").option("compression","snappy").saveAsTable(...) df.write.mode("overwrite").format("parquet").option("compression","snappy").saveAsTable(...)
Is there anything else that I need to set or am I doing something wrong?
Thank you

Pandas dataframes wont equal each other due to trailing white space being removed

Context
I have two Pandas dataframes (df1 and df2). One of these is created using the read_csv() method and the other is created using the read_sql() method, n.b. I'm using MySQL and SQL Alchemy core.
The MySql table contains the exact same records as the csv file, yet, whenever I test for equality between them as below, I keep getting a False value.
df1==df2
Upon inspecting the Pandas dataframe comparison report, MySQL table and the csv file I have noticed that the lengths between records vary between the csv file and records in the MySQL database due to differences in trailing white space.
So the question is:
How do I get MySQL to retain trailing whitespaces ?
I have tried changing the column data types to varchar() but seems to not work. Perhaps it has to do with the configuration of MySQL...
Anyway I've spent a day on this problem so thought I'd reach out.

What is the most efficient way to concatenate thousands of dataframes in Python?

I currently stored some data from a website that I have scraped into .csv for each product of the website. Since it is a quite popular website, I obtained more than 30,000 csv, that I need to merge into one. I'm not really an expert in pandas, but my first reaction was to rely on the concat() function. That is, my code looks like that:
df = pd.DataFrame(columns=["product_id", "price"])
for file in onlyfiles:
df1 = pd.read_csv(file)
df = pd.concat([df, df1])
where onlyfiles represents the directory in which all my dataframes are stored. It works, but it is starting to slow down as the number of dataframes increases. However, it is obviously not the best efficient way to achieve this goal. Does anybody have an idea of a more efficient method to use here?
Thank you for your help.
You need to start storing your data in an SQL database, CSV files are not databases.
You might want to look into Postgresql as SQLite may not have all of the features you need. You should be able to set up SQL code that dumps data into a single database from a CSV file. I have an automated process that pulls CSV data into a database, regularly.
You can interact with Postgres with the Psycopg2 library in python. Another thing you may want to consider is using Pandasql, which allows you to manipulate your Pandas data frames with SQL code. I always import Pandasql when working with Pandas dataframes.
Here is an example of my Postgres CSV file data import:
--Data Import Query
COPY stock_data(date, ticker, industry, open, high, low, close, adj_close, volume, dor)
FROM 'C:\Users\storageplace\Desktop\username\company_data\stock_data\stockdata.csv'
DELIMITER ','
CSV HEADER;

How do I get a DASK dataframe into a MySQL datatable?

I have fetched data from a CSV file, and it is held and manipulated in my Dask dataframe. From there I need to write the data into a data table. I have not really come across any solutions for this. Pandas have built-in functionality for this with its to_sql function, so I am unsure whether I need to convert to Pandas first? I currently think that converting the Dask dataframe to Pandas will cause it to be loaded fully into memory, which may defeat the purpose of using Dask in the first place.
What would the best and fastest approach be to write a Dask dataframe to a datatable?
assuming you have dask dataframe as df, you just need to this this:
df.to_sql(table, schema=schema, uri=conn_str, if_exists="append", index=False)
i've found this is easily the quickest method for dask dataframes.
I have no problem with #kfk's answer, as I also investigated that, but my solution was as follows.
I drop the DASK dataframe to a csv, and from there pick the CSV up with a Golang application that shoves the data into Mongo using multi-threading. For 4.5 million rows, the speed went from 38 minutes using "load local infile" to 2 minutes using a multi-threaded app.
pandas.to_sql() is not the fastest way to load data into a database. to_sql() uses the ODBC driver connection which is a lot slower than the built in bulk load method.
You can load data from a csv file in MySQL like this:
LOAD DATA INFILE 'some_file.csv'
INTO TABLE some_mysql_table
FIELDS TERMINATED BY ';'
So what I would do is this:
import dask.dataframe as dd
from sqlalchemy import create_engine
#1) create a csv file
df = dd.read_csv('2014-*.csv')
df.to_csv("some_file.csv")
#2) load the file
sql = """LOAD DATA INFILE 'some_file.csv'
INTO TABLE some_mysql_table
FIELDS TERMINATED BY ';"""
engine = create_engine("mysql://user:password#server")
engine.execute(sql)
You wrap the above into a function easily and use it instead of to_sql.

Insert pandas dataframes to SQL

I have 10,000 dataframes (which can all be transformed into JSONs). Each dataframe has 5,000 rows. So, eventually it's quite a lot of data that I would like to insert to my AWS RDS databases.
I want to insert them into my databases but I find the process using PyMySQL a bit too slow as I iterate through every single row and insert them.
First question, is there a way to insert the whole dataframe into a table straight away. I've tried using the "to_sql" function in the dataframe library but it doesn't seem to work as I am using Python 3.6
Second question, should I use NoSQL instead of RDS? What would be the best way to structure my (big) data?
Many thanks
from sqlalchemy import create_engine
engine = create_engine("mysql://......rds.amazonaws.com")
con = engine.connect()
my_df.to_sql(name='Scores', con=con, if_exists='append')
The table "Scores" is already existing and I would like to put all of my databases into this specific table. Or is there a better way to organise my data?
It seems like you're either missing the package or the package is installed in a different directory. Use a file manager to look for the missing library libmysqlclient.21.dylib and then copy it to the correct folder /Users/anaconda3/lib/python3.6/site-packages/MySQLdb/_mysql.cpython-36m-darwin.so.
My best guess is it's in either your lib or MySQLdb directory. You may also be able to find it in a virtual environment that you have set up.

Categories