I want to read data from my oracle, I use the pandas's read_sql and set the parameter chunksize=20000,
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("my oracle")
df = pd.read_sql("select clause",engine,chunksize=20000)
It returns a iterator, and I want to convert this generator to a dataframe usingdf = pd.DataFrame(df), but it's wrong, How can the iterator be converted to a dataframe?
This iterator can be concatenated, then it return a dataframe:
df = pd.concat(df)
You can view pandas.concat document.
If you can't use concat directly, try the following:
gens = pd.read_sql("select clause",engine,chunksize=20000)
dflist = []
for gen in gens:
dflist.append(gen)
df = pd.concat(dflist)
Related
I am using pyreadstat library to read sas dataset files(*.sas7bdat, *.xpt).
import pyreadstat as pd
import pandas as pda
import sys
import json
FILE_LOC = sys.argv[1]
PAGE_SIZE = 100
PAGE_NO = int(sys.argv[2])-1
START_FROM_ROW = (PAGE_NO * PAGE_SIZE)
pda.set_option('display.max_columns',None)
pda.set_option('display.width',None)
pda.set_option('display.max_rows',None)
df = pd.read_sas7bdat(FILE_LOC, row_offset=START_FROM_ROW, row_limit=PAGE_SIZE,output_format='dict')
finalList = []
for key in df[0]:
l = list(map(lambda x: str(x) if str(x)=="nan" else x, df[0][key].tolist()))
nparray = {key:l}
finalList.append(nparray)
return json.dumps(finalList)
How to perform sorting using pyreadstat library?
Unfortunately Pyreadstat cannot return sorted data. You need to read the sas7bdat file data into memory and then you can sort it.
In order to sort, take into consideration that Pyreadstat returns a tuple of a pandas dataframe and a metadata object. Once you have the dataframe you can sort it by one or multiple columns using the sort_values method . Therefore it is better to get a dataframe rather than a dictionary in this case.
df, meta = pd.read_sas7bdat(FILE_LOC, row_offset=START_FROM_ROW, row_limit=PAGE_SIZE)
#sort
df_sorted = df.sort_values(["columnA", "columnB"])
#replace nans
df = df.fillna("nan")
# you can directly convert to json
#check the options in the documentation, it may give you something different as you want.
out = df.to_json()
# otherwise transform to dict and build your json as before
out = df.to_dict(orient='list')
In the following code I'm trying to write multiple sheets from an Excel file,remove the empty cells, group the columns and store the result in another excel file:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
df = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
When I execute my script iget the following error:
AttributeError: 'dict' object has no attribute 'dropna'
how can I fix it?
When you provide a list of sheets for sheet_name param, your return object is dict of DataFrame as described here
dropna is method of DataFrame so you have to select the sheet first. for example
df['R9_14062021'].dropna(inplace=True)
Taken from pandas documentation for pd.read_excel:
If you give sheet_name a list, you will receive a list of dataframes.
Meaning you'll have to go over each dataframe and dropna() separately because you can't dropna() on a dictionary, your code will look like this:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
dfs_list = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
for i in dfs_list:
df = dfs_list[i]
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
The main difference here is the usage of
for i in dfs_list:
df = dfs_list[i]
in order to apply each change you are doing to each dataframe, if you want a specific dataframe you should do: df[0].dropna() for example.
Hope this helps and this is what you were aiming for.
I have a CSV file of around 40K rows. And I want to delete 10K rows with conditions(eg: user_name = Max). And my data is like :
user1_name,user2_name,distance
"Unews","CCSSuptConnelly",""
"Unews","GwapTeamFre",""
"Unews","WilsonRecDept","996.27"
"Unews","ChiOmega_ISU","1025.03"
"Unews","officialtshay",""
"Unews","hari",""
"Unews","lashaunlester7",""
"Unews","JakeSlaughter5","509.53"
Thank you!
import pandas as pd
Read the csv
df = pd.read_csv('filename')
Create an index
index_names = df[ df['user2_name'] == 'Max' ].index
Drop it
df.drop(index_names, inplace = True)
You can use the Pandas library for this kind of problems and then use the .loc[] function. Link to the docs: Loc Function in pandas
import pandas as pd
df = pd.read_csv('name.csv')
df_filtered = df.loc[!(df['user_name'] == 'Max']),:]
The code below is from the vaex documentation:
pandas_df = pd.read_sql_query('SELECT * FROM MYTABLE', con=engine)
df = vaex.from_pandas(pandas_df, copy_index=False)
Description
I have data more than RAM.
But, when I use above code, it try and pull all data in panda dataframe.
So to solve this I used chunksize attribute which gives a generator.
To convert from generator to pandas dataframe again it is needs memory.
Below is the code I tried.
import vaex
df = pd.read_sql_query('select * from "user"."table"', conn, chunksize=1000000)
chunk_list = []
for i in df:
chunk_list.append(i)
data = pd.concat(chunk_list)
df2 = vaex.from_pandas(data)
alldat=df2.concat(df2)
Please help me with this issue.
Unexpected error: <class 'NameError'>
My code to create a table and add a record from a csv, not sure how to rectify the error, should I use the same variable as in csv?
I've never had good luck using the MySQL connector directly either. Now that it's installed, try a combo of sqlalchemy and pandas. Pandas can do the table creation for you, and it will trim your code a lot.
import sqlalchemy
import pandas as pd
# MySQL database connection
engine_stmt = 'mysql+mysqldb://%s:%s#%s:3306/%s' % (username, password,
server,database)
engine = sqlalchemy.create_engine(engine_stmt)
# get your data into pandas
df = pd.read_csv("file/location/name.csv")
# adjust your dataframe as you want it to look in the database
df = df.rename(columns={0: 'yearquarter', 1: 'sms_volumes')
# using your existing function to assign start/end row by row
for index, row in df.iterrows():
dt_start, dt_end = getdatesfromquarteryear(row['yearquarter'])
df.loc[index, 'sms_start_date'] = dt_start
df.loc[index, 'sms_end_date'] = dt_end
# write the entire dataframe to database
df.to_sql(name='sms_volumes', con=engine,
if_exists='append', index=False, chunksize=1000)
print('All data inserted!')
Pandas can make it easy to get the data from your table back into a dataframe, similar to the read_csv():
# create a new dataframe from your existing table
new_df = pd.read_sql("SELECT * FROM sms_volumes", engine)