Python Contains in Panda DATAFRAME - python

I am grouping in each iteration the same price, add quantity together and combine the name of the exchange like :
asks_price asks_qty exchange_name_ask bids_price bids_qty exchange_name_bid
0 20156.51 0.000745 Coinbase 20153.28 0.000200 Coinbase
1 20157.52 0.050000 Coinbase 20152.27 0.051000 Coinbase
2 20158.52 0.050745 CoinbaseFTX 20151.28 0.051200 KrakenCoinbase
but to build orderbook i have to drop each time the row of one of the provider to update it so i do :
self.global_orderbook = self.global_orderbook[
self.global_orderbook.exchange_name_ask != name]
And then i have with Coinbase for example
asks_price asks_qty exchange_name_ask bids_price bids_qty exchange_name_bid
0 20158.52 0.050745 CoinbaseFTX 20151.28 0.051200 KrakenCoinbase
But i want that KrakenCoinbase also leave
so I want to do something like :
self.global_orderbook = self.global_orderbook[name not in self.global_orderbook.exchange_name_ask]
It doesnt work
I already try with contains but i cant on a series
self.global_orderbook = self.global_orderbook[self.global_orderbook.exchange_name_ask.contains(name)]
but
'Series' object has no attribute 'contains'
Thanks for help

To do that we can use astype(str)
like :
self.global_orderbook = self.global_orderbook[self.global_orderbook.exchange_name_ask.astype(str).str.contains(name,regex=False)]
And then it works we can use on column with string

Related

Auto Increment column value is larger than I expected

When I put data in DB with python,
I met a problem that auto increment column value is larger than I expected.
Assume that I use the following function multiple times to put data into the DB.
'db_engine' is a DB, which contains table 'tbl_student' and 'tbl_score'.
To manage the total number of student, table 'tbl_student' has Auto increment column named 'index'.
def save_in_db(db_engine, dataframe):
# tbl_student
student_dataframe = pd.DataFrame({
"ID":dataframe['ID'],
"NAME":dataframe['NAME'],
"GRADE":dataframe['GRADE'],
})
student_dataframe.to_sql(name="tbl_student",con=db_engine, if_exists='append', index=False)
# tbl_score
score_dataframe = pd.DataFrame({
"SCORE_MATH": dataframe['SCORE_MATH'],
"SCORE_SCIENCE":dataframe['SCORE_SCIENCE'],
"SCORE_HISTORY":dataframe['SCORE_HISTORY'],
})
score_dataframe.to_sql(name="tbl_score",con=db_engine, if_exists='append', index=False)
'tbl_student' after some inputs is as follows:
index
ID
NAME
GRADE
0
2023001
Amy
1
1
2023002
Brady
1
2
2023003
Caley
4
6
2023004
Dee
2
7
2023005
Emma
2
8
2023006
Favian
3
12
2023007
Grace
3
13
2023008
Harry
3
14
2023009
Ian
3
Please take a look column 'index'.
When I put in several times, 'index' has larger value than I expected.
What should I try to solve this problem?
You could try:
student_dataframe.reset_index()
Actually, the problem situation is 'index' part connected to another table as a FOREIGN KEY.
Every time I add a data, the error occurred because there was no key(because the index value is not continuous!).
I solve this problem by checking the index part once before put data in DB and setting it as key.
Following code is what I tried.
index_no = get_index(db_engine)
dataframe.index = dataframe.index + index_no + 1 - len(dataframe)
dataframe.reset_index(inplace=True)
If anyone has the same problem, it could be nice way to try another way rather than trying to make auto increment key sequential.

Convert Multiples Columns to a List inside a single column in pandas

I am using azure databricks, getting differents excel forms storaged in a blob. I need to keep 3 columns as it is and group as a list other multiples (and differents for each form) responses columns.
My main goal here is to transforme those diferents columns in one unique with a object that the keys are the title of the questions and the value is the response.
I have the following dataframe:
id
name
email
question_1
question_2
question_3
1
mark
mark#email.com
response_11
response_21
response_31
3
elon
elon#email.com
response_12
response_22
response_32
I would like to have the following output.
id
name
email
responses
1
mark
mark#email.com
{'question1':'response'11','question2':'response21','question3':'response_31'}
2
elon
elon#email.com
{'question1':'response'12','question2':'response22','question3':'response_32'}
3
zion
zion#email.com
{'question1':'response'13','question2':'response23','question3':'response_33'}
How i could get that using pandas? i already did the following:
baseCols = ['id','name','email']
def getFormsColumnsName(df):
df_response_columns = df.columns.values.tolist()
for deleted_column in cols:
df_response_columns.remove(deleted_column)
return df_response_columns
formColumns = getFormsColumnsName(df)
df = df.astype(str)
df['responses'] = df[formColumns].values.tolist()
display(df)
But this give me that strange list of responses:
id
name
email
responses
1
mark
mark#email
0: "response11"1: "response12"2: "response13"3: "['response11', 'response12', 'response13' "[]"]"
i dont know what i should do to get what i expected.
Thank you in advance!
You can get your responses column by using pd.DataFrame.to_dict("records").
questions = df.filter(like="question")
responses = questions.to_dict("records")
out = df.drop(questions, axis=1).assign(responses=responses)
output:
id name email responses
0 1 mark mark#email.com {'question_1': 'response_11', 'question_2': 'response_21', 'question_3': 'response_31'}
1 3 elon elon#email.com {'question_1': 'response_12', 'question_2': 'response_22', 'question_3': 'response_32'}

How to count the unique values per date using python

I am practicing data analytics and I am stuck on one problem.
TRAINING DATAFRAME
I group the dataframe by the Date Purchased and set it to unique because I want to count the unique value for each date purchased.
training.groupby('DATE PURCHASED')['Account - Store Name'].unique().to_frame()
So it looks like this:
GROUPBY DATE PURCHASED
Now that the data has been aggregated, I want to count the items in that column, so I used.split(',').
training_groupby['Account - Store Name'].apply(lambda x: x.split(','))
but I got error:
AttributeError: 'numpy.ndarray' object has no attribute 'split'
Can someone help me, with how to count the number of unique values per Date Purchased. I've been trying to solve this for almost a week now. I tried to search on Youtube and Google it. But I can't find anything that will help me.
I think this is what you want?
training_groupby["Total Purchased"] = training_groupby["Account - Store Name"].apply(lambda x: len(set(x)))
You can do multiple aggregations in the same pandas.DataFrame.groupby clause :
Try this :
out = (training
.groupby(['DATE PURCHASED'])
.agg(**{
'Account - Store Name': ('Account - Store Name', 'unique'),
'Items Count': ('Account - Store Name', 'nunique'),
})
)
# Output :
print(out)
Account - Store Name Items Count
DATE PURCHASED
13/01/2022 [Landmark Makati, Landmark Nuvali] 2
14/01/2022 [Landmark Nuvali] 1
15/01/2022 [Robinsons Dolores, Landmark Nuvali] 2
16/01/2022 [Robinsons Ilocos Norte, Landmarj Trinoma] 2
19/01/2022 [Shopwise Alabang] 1

python multiple for loop question (pandas, dataframe)

idx_list = []
for idx, row in df_quries_copy.iterrows():
for brand in brand_name:
if row['user_query'].contains(brand):
idx_list.append(idx)
else:
continue
brand_name list looks like below
brand_name = ['Apple', 'Lenovo', Samsung', ... ]
I have df_queries data frame which has the query the user used
the table is looks like below
user_query
user_id
Apple Laptop
A
Lenovo 5GB
B
and also I have a brand name as a list
i want to find out the users who uses related with brand such as 'Apple laptop'
but when I run the script, I got a message saying that
'str' object has no attribute 'contains'
how am I supposed to do to use multiple for loop ?
Thank you in advance.
for brand in brand_name[:100]:
if len(copy_df[copy_df['user_query'].str.contains(brand)]) >0:
ls.append(copy_df[copy_df['user_query'].str.contains(brand)].index)
else:continue
I tried like answer but the whole dataframe came out in a sudden as a result
You can use df_quries_copy[df_quries_copy['user_query'].str.contrains(brand)].index to get index directly.
for brand in brand_name:
df_quries_copy[df_quries_copy['user_query'].str.contrains(brand)].index
Or in your code, use brand in row['user_query'] since row['user_query'] is a string value.

Need to find data from one dataframe and see if its another in Pandas [Python]

I currently have 2 csv files and am reading them both in, and need to get the ID's in one csv and find them in the other so that I can get their row of data. Currently I have the following code that I believe goes through the first dataframe but only is adding the last match onto the new dataframe. I need it to add all of the subsequent rows however.
Here is my code:
patientSet = pd.read_csv("794_chips_RMA.csv")
affSet = probeset[probeset['Analysis']==1].reset_index(drop=True)
houseGenes = probeset[probeset['Analysis']==0].reset_index(drop=True)
for x in affSet['Probeset']:
#patients = patientSet[patientSet['ID']=='1557366_at'].reset_index(drop=True)
#patients = patientSet[patientSet['ID']=='224851_at'].reset_index(drop=True)
patients = patientSet[patientSet['ID']==x].reset_index(drop=True)
print(affSet['Probeset'])
print(patientSet['ID'])
print(patients)
The following is the output:
0 1557366_at
1 224851_at
2 1554784_at
3 231578_at
4 1566643_a_at
5 210747_at
6 231124_x_at
7 211737_x_at
Name: Probeset, dtype: object
0 1007_s_at
1 1053_at
2 117_at
3 121_at
4 1255_g_at
...
54670 AFFX-ThrX-5_at
54671 AFFX-ThrX-M_at
54672 AFFX-TrpnX-3_at
54673 AFFX-TrpnX-5_at
54674 AFFX-TrpnX-M_at
Name: ID, Length: 54675, dtype: object
ID phchp003v1 phchp003v2 phchp003v3 ... phchp367v1 phchp367v2 phchp368v1 phchp368v2
0 211737_x_at 12.223453 11.747159 9.941889 ... 14.828389 9.322779 10.609053 10.771162
as you can see, it is only matching the very last ID from the first dataframe, and not all of them. How can I get them all to match and be in patients? Thank you.
you probably want to use the merge function
df_inner = pd.merge(df1, df2, on='id', how='inner')
check here https://www.datacamp.com/community/tutorials/joining-dataframes-pandas search for "inner join"
--edit--
you can specify the columns (using left_on=None,right_on=None,) , look here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
#Rui Lima already posted the correct answer, but you'll need to use the following to make it work:
df = pd.merge(patientSet , affSet, on='ID', how='inner')

Categories