I have a pandas dataframe which looks like this:
Name Age
0 tom 10
1 nick 15
2 juli 14
I am trying to iterate over each name --> connect to a mysql database --> match the name with a column in the database --> fetch the id for the name --> and replace the id in the place of name
in the above data frame. The desired output is as follows:
Name Age
0 1 10
1 2 15
2 4 14
The following is the code that I have tried:
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
for index, rows in df.iterrows():
cquery="select id from students where studentsName="+'"' + rows['Name'] + '"'
sid = pd.read_sql(cquery, con=engine)
df['Name'] = sid['id'].iloc[0]
print(df[['Name','Age')
The above code prints the following output:
Name Age
0 1 10
1 1 15
2 1 14
Name Age
0 2 10
1 2 15
2 2 14
Name Age
0 4 10
1 4 15
2 4 14
I understand it iterates through the entire table for each matched name and prints it. How do you get the value replaced only once.
Slight rewrite of your code, if you want to do a transformation in general on a dataframe this is a better way to go about it
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def replace_name(name: str) -> int:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
return sid['id'].iloc[0]
df[Name] = df[Name].apply(lambda x: replace_name(x.value))
This should perform the transformation you're looking for
The problem in your code as written is the line:
df['Name'] = sid['id'].iloc[0]
This sets every value in the Name column to the first id entry in your query result.
To accomplish what you want, you want something like:
df.loc[index, 'Name'] = sid['id'].iloc[0]
This will set the value at index location index in column name to the first id entry in your query result.
This will accomplish what you want to do, and you can stop reading here if you're in a hurry. If you're not in a hurry, and you'd like to become wiser, I encourage you to read on.
It is generally a mistake to loop over the rows in a dataframe. It's also generally a mistake to iterate through a list carrying out a single query on each item in the list. Both of these are slow and error-prone.
A more idiomatic (and faster) way of doing this would be to get all the relevant rows from the database in one query, merge them with your current dataframe, and then drop the column you no longer want. Something like the following:
names = df['Name'].tolist()
query = f"select id, studentsName as Name where name in({','.join(names)})"
student_ids = pd.read_sql(query, con=engine)
df_2 = df.merge(student_ids, on='Name', how='left')
df_with_ids = df_2[['id', 'Age']]
One query executed, no loops to worry about. Let the database engine and Pandas do the work for you.
You can do this kind of operations the following way, please follow comments and feel free to ask questions:
import pandas as pd
# create frame
x = pd.DataFrame(
{
"name": ["A", "B", "C"],
"age": [1, 2, 3]
}
)
# create some kind of db
mock_database = {"A": 10, "B": 20, "C": 30}
x["id"] = None # add empty column
print(x)
# change values in the new column
for i in range(len(x["name"])):
x["id"][i] = mock_database.get(x["name"][i])
print("*" * 100)
print(x)
A good way to do that would be :
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
name_ids = []
for student_name in df['Name']:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
name_ids.append(sid if sid is not None else None )
# DEBUGED WITH name_ids = [1,2,3]
df['Name'] = name_ids
print(df)
I checked with an example list of ids and it works , I guess if the query format is correct this will work.
Performance-wise I could not think a better solution , since you will have to do a lot of queries (one for each student) but there probably is some way to get all the ids with less queries.
Related
Say I have a table that looks something like:
+----------------------------------+-------------------------------------+----------------------------------+
| ExperienceModifier|ApplicationId | ExperienceModifier|RatingModifierId | ExperienceModifier|ActionResults |
+----------------------------------+-------------------------------------+----------------------------------+
| | | |
+----------------------------------+-------------------------------------+----------------------------------+
I would like to grab all of the columns that lead with 'ExperienceModifier' and stuff the results of that into its own dataframe. How would I accomplish this with pandas?
You can try pandas.DataFrame.filter
df.filter(like='ExperienceModifier')
If you want to get columns only contains ExperienceModifier at the beginning.
df.filter(regex='^ExperienceModifier')
Ynjxsjmh's answer will get all columns that contain "ExperienceModifier". If you literally want columns that start with that string, rather merely contain it, you can do new_df = df[[col for col in df.columns if col[:18] == 'ExperienceModifier']]. If all of the desired columns have | after "ExperienceModifier", you could also do new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']]. All of these will create a view of the dataframe. If you want a completely separate dataframe, you should copy it, like this: new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']].copy(). You also might want to create a multi-index by splitting the column names on | rather than creating a separate dataframe.
The accepted answer does easly the job but I still attach my "hand made version" that works:
import pandas as pd
import numpy as np
import re
lst = [[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4]]
column_names = [['ExperienceModifier|ApplicationId', 'ExperienceModifier|RatingModifierId', 'ExperienceModifier|ActionResults','OtherName|ActionResults']]
data = pd.DataFrame(lst, columns = column_names)
data
old_and_dropped_dataframes = []
new_dataframes=[]
for i in np.arange(0,len(column_names[0])):
column_names[0][i].split("|")
splits=re.findall(r"[\w']+", column_names[0][i])
if "ExperienceModifier" in splits:
new_dataframe = data.iloc[:,[i]]
new_dataframes.append(new_dataframe)
else:
old_and_dropped_dataframe = data.iloc[:,[i]]
old_and_dropped_dataframes.append(old_and_dropped_dataframe)
ExperienceModifier_dataframe = pd.concat(new_dataframes,axis=1)
ExperienceModifier_dataframe
OtherNames_dataframe = pd.concat(old_and_dropped_dataframes,axis=1)
OtherNames_dataframe
This script creates two new dataframes starting from the initial dataframe: one that contains the columns whose names start with ExperienceModifier and an other one that contains the columns that do not start with ExperienceModifier.
How to generate random 20 digit UID(Unique Id) in python. I want to generate UID for each row in my data frame. It should be exactly 20 digits and should be unique.
I am using uuid4() but it generates 32 digits UID, would it be okay to slice it [:21]? I don't want the id to repeat in the future.
Any suggestions would be appreciated!
I'm definately no expert in Python nor Pandas, but puzzled the following together. You might find something usefull:
First I tried to use Numpy but I hit the max of upper limit:
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'ID':[0,0,0,0]}
df = pd.DataFrame(data)
df.ID = np.random.randint(0, 9223372036854775807, len(df.index), np.int64)
df.ID = df.ID.map('{:020d}'.format)
print(df)
Results:
Name ID
0 Tom 03486834039218164118
1 Jack 04374010880686283851
2 Steve 05353371839474377629
3 Ricky 01988404799025990141
So then I tried a custom function and applied that:
import pandas as pd
import random
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'ID':[0,0,0,0]}
df = pd.DataFrame(data)
def UniqueID():
UID = '{:020d}'.format(random.randint(0,99999999999999999999))
while UniqueID in df.ID.unique():
UID = '{:020d}'.format(random.randint(0,99999999999999999999))
return UID
df.ID = df.apply(lambda row: UniqueID(), axis = 1)
print(df)
Returns:
Name ID
0 Tom 46160813285603309146
1 Jack 88701982214887715400
2 Steve 50846419997696757412
3 Ricky 00786618836449823720
I think uuid4() in python works, just slice it accordingly
I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)
I have a dataframe containing an id column, a linked id column, and a value column. The linked id is "optional" and refers to a different row in the same dataframe (with -1 denoting no link). What I want to do is select rows that have a valid link where value is equal to value in the row given by the linked id:
import pandas as pd
df = pd.DataFrame({"id": [0,1,2,3,4,5], "linkid": [-1,3,-1,0,5,-1], "value": [10, 20, 30, 20, 40, 50]})
print(df)
# should match row 1 (only): id 1 has value 20 and linkid 3 also has value 20
# should not match
matched = df.loc[df.value == df.loc[df.id == df.linkid].value]
# ValueError: Can only compare identically-labeled Series objects
My attempt above results in an error. I suspect my attempt is pretty far from the mark but not sure how to proceed. I want to avoid loops for performance reasons. Any help gratefully received
I thought it was clear enough but as per the comment in the code, my required output in this example is row 1 from the original dataframe:
id linkid value
1 3 20.0
I think you can try this:
new_df = df.merge(df[['id','value']].rename(columns={'id':'linkid'}),how='left',on="linkid")
new_df[new_df.value_x == new_df.value_y]
Create another column value_link for the column linkid that is the value of the id == linkid . As follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id": [0,1,2,3,4,5], "linkid": [-1,3,-1,0,5,-1], "value": [10, 20, 30, 20, 40, 50]})
df['value_link'] = df.linkid.apply(lambda x: df[df['id'] == x].value.values[0] if x != -1 else np.nan)
matched = df[df.value == df.value_link]
The Python Dataset module is based on Sqlalchemy and exposes a function to return all records in a table called all(). all() returns an iterable Dataset object.
users = db['user'].all()
for user in db['user']:
print(user['age'])
What is the simplest way to convert a Dataset object to a Pandas DataFrame object?
For clarity, I am interested in utilizing Dataset's functionality as it has already loaded the table into a Dataset object.
this worked for me:
import dataset
import pandas
db = dataset.connect('sqlite:///db.sqlite3')
data = list(db['my_table'].all())
dataframe = pandas.DataFrame(data=data)
import pandas as pd
df = pd.DataFrame(data=db['user'])
df
similarly
pd.DataFrame(db['user'])
should do the same thing
You can also specify the columns or index:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
After some significant time invested into Dataset module, I found that the all() could be iterated into a list and then turned into a pandas dataframe. Is there a better way of doing this?
import dataset
import pandas as pd
# create dataframe
df = pd.DataFrame()
names = ['Bob', 'Jane', 'Alice', 'Ricky']
ages = [31, 30, 31, 30]
df['names'] = names
df['ages'] = ages
print(df)
# create a dict oriented as records from dataframe
user = df.to_dict(orient='records')
# using dataset module instantiate database
db = dataset.connect('sqlite:///mydatabase.db')
# create a reference to a table
table = db['user']
# insert the complete dict into database
table.insert_many(user)
# use Dataset .all() to retrieve all table's rows
from_sql = table.all() # custom ResultIter type (iterable)
# iterate ResultIter type into a list
data = []
for row in from_sql:
data.append(row)
# create dataframe from list and ordereddict keys
df_new = pd.DataFrame(data, columns=from_sql.keys)
# this does not drop the id column, but it should??
df_new.drop(columns=['id'])
print(df_new)
'''
names ages
0 Bob 31
1 Jane 30
2 Alice 31
3 Ricky 30
id names ages
0 1 Bob 31
1 2 Jane 30
2 3 Alice 31
3 4 Ricky 30
'''
I've created some helper functions that should make this process even simpler:
import dataset
import pandas as pd
def df_dataset_save(df, table_name, db_name='db'):
try:
df = df.to_dict(orient='records')
db = dataset.connect('sqlite:///' + db_name + '.sqlite')
table = db[table_name]
table.insert_many(df)
return 'success'
except Exception as e:
print(e)
return None
def df_dataset_query_all(table_name, db_name='db', ids=False):
try:
db = dataset.connect('sqlite:///' + db_name + '.sqlite')
table = db[table_name]
from_sql = table.all()
data = []
for row in from_sql:
data.append(row)
df = pd.DataFrame(data, columns=from_sql.keys)
if not ids:
df.drop('id', axis=1, inplace=True)
return df
except Exception as e:
print(e)
return None
# create dataframe
users = pd.DataFrame()
names = ['Bob', 'Jane', 'Alice', 'Ricky']
ages = [31, 30, 31, 30]
users['names'] = names
users['ages'] = ages
# save dataframe
df_dataset_save(users, 'users')
# query saved dataframe
new_user = df_dataset_query_all('users')
print(new_user)
'''
names ages
0 Bob 31
1 Jane 30
2 Alice 31
3 Ricky 30
'''