How to generate random 20 digit UID(Unique Id) in python. I want to generate UID for each row in my data frame. It should be exactly 20 digits and should be unique.
I am using uuid4() but it generates 32 digits UID, would it be okay to slice it [:21]? I don't want the id to repeat in the future.
Any suggestions would be appreciated!
I'm definately no expert in Python nor Pandas, but puzzled the following together. You might find something usefull:
First I tried to use Numpy but I hit the max of upper limit:
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'ID':[0,0,0,0]}
df = pd.DataFrame(data)
df.ID = np.random.randint(0, 9223372036854775807, len(df.index), np.int64)
df.ID = df.ID.map('{:020d}'.format)
print(df)
Results:
Name ID
0 Tom 03486834039218164118
1 Jack 04374010880686283851
2 Steve 05353371839474377629
3 Ricky 01988404799025990141
So then I tried a custom function and applied that:
import pandas as pd
import random
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'ID':[0,0,0,0]}
df = pd.DataFrame(data)
def UniqueID():
UID = '{:020d}'.format(random.randint(0,99999999999999999999))
while UniqueID in df.ID.unique():
UID = '{:020d}'.format(random.randint(0,99999999999999999999))
return UID
df.ID = df.apply(lambda row: UniqueID(), axis = 1)
print(df)
Returns:
Name ID
0 Tom 46160813285603309146
1 Jack 88701982214887715400
2 Steve 50846419997696757412
3 Ricky 00786618836449823720
I think uuid4() in python works, just slice it accordingly
Related
I'm using pandas 1.1.3, the latest available with Anaconda.
I have two DataFrames, imported from a .txt and a .xlsx file. They have a column called "ID" which is an int64 (verified with df.info()) on both DataFrames.
df1:
ID Name
0 1234564567 Last, First
1 1234564569 Last, First
...
df2:
ID Amount
0 1234564567 59.99
1 5678995545 19.99
I want to check if all of the IDs on df1 are on df2. For this I create a series:
foo = df1["ID"].isin(df2["ID"])
And I get that all values are False, even though manually I checked and the values do match.
0 False
1 False
2 False
3 False
4 False
...
I'm not sure if I'm missing something, if there is something wrong with the environment, or if it is a known bug.
You must do something wrong. Try to reproduce this error with a toy example as I did here. The below works for me.
Reproducing with and sharing a minimal example not only allows you to challenge your error but also allows us to provide help.
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'nick'], 'ID':[1234564567, 1234564569]}
data2 = {'Name':['Tom', 'nick'], 'ID':[1234564567, 5678995545]}
# Create DataFrame
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False
Name: ID, dtype: bool
EDIT: with Paul's data I don't get any error. See the importance of providing examples?
import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False
import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
now we have that set up we get to the meat...
df1["ID"].apply(lambda x: df2['ID'].isin([x]))
Which shows
0 1
0 True False
1 False False
That ID 0 in df1 is in ID 0 of df2
I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:
I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:
def split_country(data):
d_list = []
for index, row in data.iterrows():
for value in str(row['Country']).split(','):
d_list.append({'Name':row['Name'],
'value':value})
data = data.append(d_list, ignore_index=True)
data = data.groupby('Name')['value'].value_counts()
data = data.unstack(level=-1).fillna(0)
return (data)
The final output is something like this:
I'm trying to parallelize the above process by passing my dataframe (df) using the following:
import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])
But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help
multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
For a test DataFrame with 1M rows, the following code took 1.54 seconds.
First, use pandas.DataFrame.explode on the column of lists
If the column is strings, first use ast.literal_eval to convert them to list type
df.countries = df.countries.apply(ast.literal_eval)
If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum
import pandas as pd
from ast import literal_eval
# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}
# create the dataframe
df = pd.DataFrame(data)
# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)
# explode the lists
df = df.explode('countries')
# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()
# display(df_counts)
name Canada China UK USA
0 Jack 0 1 1 0
1 James 1 0 0 1
2 John 0 0 1 1
I have a pandas dataframe which looks like this:
Name Age
0 tom 10
1 nick 15
2 juli 14
I am trying to iterate over each name --> connect to a mysql database --> match the name with a column in the database --> fetch the id for the name --> and replace the id in the place of name
in the above data frame. The desired output is as follows:
Name Age
0 1 10
1 2 15
2 4 14
The following is the code that I have tried:
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
for index, rows in df.iterrows():
cquery="select id from students where studentsName="+'"' + rows['Name'] + '"'
sid = pd.read_sql(cquery, con=engine)
df['Name'] = sid['id'].iloc[0]
print(df[['Name','Age')
The above code prints the following output:
Name Age
0 1 10
1 1 15
2 1 14
Name Age
0 2 10
1 2 15
2 2 14
Name Age
0 4 10
1 4 15
2 4 14
I understand it iterates through the entire table for each matched name and prints it. How do you get the value replaced only once.
Slight rewrite of your code, if you want to do a transformation in general on a dataframe this is a better way to go about it
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def replace_name(name: str) -> int:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
return sid['id'].iloc[0]
df[Name] = df[Name].apply(lambda x: replace_name(x.value))
This should perform the transformation you're looking for
The problem in your code as written is the line:
df['Name'] = sid['id'].iloc[0]
This sets every value in the Name column to the first id entry in your query result.
To accomplish what you want, you want something like:
df.loc[index, 'Name'] = sid['id'].iloc[0]
This will set the value at index location index in column name to the first id entry in your query result.
This will accomplish what you want to do, and you can stop reading here if you're in a hurry. If you're not in a hurry, and you'd like to become wiser, I encourage you to read on.
It is generally a mistake to loop over the rows in a dataframe. It's also generally a mistake to iterate through a list carrying out a single query on each item in the list. Both of these are slow and error-prone.
A more idiomatic (and faster) way of doing this would be to get all the relevant rows from the database in one query, merge them with your current dataframe, and then drop the column you no longer want. Something like the following:
names = df['Name'].tolist()
query = f"select id, studentsName as Name where name in({','.join(names)})"
student_ids = pd.read_sql(query, con=engine)
df_2 = df.merge(student_ids, on='Name', how='left')
df_with_ids = df_2[['id', 'Age']]
One query executed, no loops to worry about. Let the database engine and Pandas do the work for you.
You can do this kind of operations the following way, please follow comments and feel free to ask questions:
import pandas as pd
# create frame
x = pd.DataFrame(
{
"name": ["A", "B", "C"],
"age": [1, 2, 3]
}
)
# create some kind of db
mock_database = {"A": 10, "B": 20, "C": 30}
x["id"] = None # add empty column
print(x)
# change values in the new column
for i in range(len(x["name"])):
x["id"][i] = mock_database.get(x["name"][i])
print("*" * 100)
print(x)
A good way to do that would be :
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
name_ids = []
for student_name in df['Name']:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
name_ids.append(sid if sid is not None else None )
# DEBUGED WITH name_ids = [1,2,3]
df['Name'] = name_ids
print(df)
I checked with an example list of ids and it works , I guess if the query format is correct this will work.
Performance-wise I could not think a better solution , since you will have to do a lot of queries (one for each student) but there probably is some way to get all the ids with less queries.
import requests
import pandas as pd
import io
"""reading url"""
"""Creating the dataframe"""
urlData=requests.get(http://demo.rahierp.com/desk#List/Employee/List).content
df = pd.read_csv(io.StrigIO(urlData.decode('utf-8')))
"""Print the dataframe"""
df
"""applying groupby() function to"""
"""group the data on reports_to"""
gk = df.groupby('Reports To')
gk
for Reports To,reports_to_df in gk:
print(Reports To)
print(reports_to_df)
put column name (i.e.Reports_to )and the value of column on which you want filter result
df.loc[lambda df: df.Reports_to == 'John']
output
name manager
0 Orid John
1 David John
using list comprehension -
print(df.loc[df['manager']=='LMN'])
The Python Dataset module is based on Sqlalchemy and exposes a function to return all records in a table called all(). all() returns an iterable Dataset object.
users = db['user'].all()
for user in db['user']:
print(user['age'])
What is the simplest way to convert a Dataset object to a Pandas DataFrame object?
For clarity, I am interested in utilizing Dataset's functionality as it has already loaded the table into a Dataset object.
this worked for me:
import dataset
import pandas
db = dataset.connect('sqlite:///db.sqlite3')
data = list(db['my_table'].all())
dataframe = pandas.DataFrame(data=data)
import pandas as pd
df = pd.DataFrame(data=db['user'])
df
similarly
pd.DataFrame(db['user'])
should do the same thing
You can also specify the columns or index:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
After some significant time invested into Dataset module, I found that the all() could be iterated into a list and then turned into a pandas dataframe. Is there a better way of doing this?
import dataset
import pandas as pd
# create dataframe
df = pd.DataFrame()
names = ['Bob', 'Jane', 'Alice', 'Ricky']
ages = [31, 30, 31, 30]
df['names'] = names
df['ages'] = ages
print(df)
# create a dict oriented as records from dataframe
user = df.to_dict(orient='records')
# using dataset module instantiate database
db = dataset.connect('sqlite:///mydatabase.db')
# create a reference to a table
table = db['user']
# insert the complete dict into database
table.insert_many(user)
# use Dataset .all() to retrieve all table's rows
from_sql = table.all() # custom ResultIter type (iterable)
# iterate ResultIter type into a list
data = []
for row in from_sql:
data.append(row)
# create dataframe from list and ordereddict keys
df_new = pd.DataFrame(data, columns=from_sql.keys)
# this does not drop the id column, but it should??
df_new.drop(columns=['id'])
print(df_new)
'''
names ages
0 Bob 31
1 Jane 30
2 Alice 31
3 Ricky 30
id names ages
0 1 Bob 31
1 2 Jane 30
2 3 Alice 31
3 4 Ricky 30
'''
I've created some helper functions that should make this process even simpler:
import dataset
import pandas as pd
def df_dataset_save(df, table_name, db_name='db'):
try:
df = df.to_dict(orient='records')
db = dataset.connect('sqlite:///' + db_name + '.sqlite')
table = db[table_name]
table.insert_many(df)
return 'success'
except Exception as e:
print(e)
return None
def df_dataset_query_all(table_name, db_name='db', ids=False):
try:
db = dataset.connect('sqlite:///' + db_name + '.sqlite')
table = db[table_name]
from_sql = table.all()
data = []
for row in from_sql:
data.append(row)
df = pd.DataFrame(data, columns=from_sql.keys)
if not ids:
df.drop('id', axis=1, inplace=True)
return df
except Exception as e:
print(e)
return None
# create dataframe
users = pd.DataFrame()
names = ['Bob', 'Jane', 'Alice', 'Ricky']
ages = [31, 30, 31, 30]
users['names'] = names
users['ages'] = ages
# save dataframe
df_dataset_save(users, 'users')
# query saved dataframe
new_user = df_dataset_query_all('users')
print(new_user)
'''
names ages
0 Bob 31
1 Jane 30
2 Alice 31
3 Ricky 30
'''