i am not very used to programming and need some help to solve a problem.
I have a .csv with 4 columns and about 5k rows, filled with questions and answers.
I want to find word collocations in each cell.
Starting point: Pandas dataframe with 4 columns and about 5k rows. (Id, Title, Body, Body2)
Goal: Dataframe with 7 columns (Id, Title, Title-Collocations, Body, Body_Collocations, Body2, Body2-Collocations) and applied a function on each of its rows.
I have found an example for Bigramm Collocation in the NLTK Documentation.
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder.apply_freq_filter(3)
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
print (finder.nbest(bigram_measures.pmi, 5))
>>>[('Beer', 'Lahai'), ('Lahai', 'Roi'), ('gray', 'hairs'), ('Most', 'High'), ('ewe', 'lambs')]
I want to adapt this function to my Pandas Dataframe. I am aware of the apply function for Pandas Dataframes, but can't manage to get it work.
This is my test-approach for one of the columns:
df['Body-Collocation'] = df.apply(lambda df: BigramCollocationFinder.from_words(df['Body']),axis=1)
but if i print that out for an example row i get
print (df['Body-Collocation'][1])
>>> <nltk.collocations.BigramCollocationFinder object at 0x113c47ef0>
I am not even sure if this is the right way. Can someone point me to the right direction?
If you want to apply BigramCollocationFinder.from_words() to each value in the Body `column, you'd have to do:
df['Body-Collocation'] = df.Body.apply(lambda x: BigramCollocationFinder.from_words(x))
In essence, apply allows you to loop through the rows and provide the corresponding value of the Body column to the applied function.
But as suggested in the comments, providing a data sample would make it easier to address your specific case.
Thx, for the answer. I guess the question i asked was not perfectly phrased. But your answer still helped me to find a solution. Sometimes its good to take a short break :-)
If someone is interested in the answer. This worked out for me.
df['Body-Collocation'] = df.apply(lambda df: BigramCollocationFinder.from_words(df['Question-Tok']),axis=1)
df['Body-Collocation'] = df['Body-Collocation'].apply(lambda df: df.nbest(bigram_measures.pmi, 3))
Related
I've used the code below to search across all columns of my dataframe to see if each row has the word "pool" and the words "slide" or "waterslide".
AR11AR11_regex = r"""
(?=.*(?:slide|waterslide)).*pool
"""
f = lambda x: x.str.findall(AR_regex, flags= re.VERBOSE|re.IGNORECASE)
d['AR'][AR11] = d['AR'].astype(str).apply(f).any(1).astype(int)
This has worked fine but when I want to write a for loop to do this for more than one regex pattern (e.g., AR11, AR12, AR21) using the code below, the new columns are all zeros (i.e., the search is not finding any hits)
for i in AR_list:
print(i)
pat = i+"_regex"
print(pat)
f = lambda x: x.str.findall(i+"_regex", flags= re.VERBOSE|re.IGNORECASE)
d['AR'][str(i)] = d['AR'].astype(str).apply(f).any(1).astype(int)
Any advice on why this loop didn't work would be much appreciated!
A small sample data frame would help understand your question. In any case, your code sample appears to have a multitude of problems.
i+"_regex" is just the string "AR11_regex". It won't evaluate to the value of the variable with the identifier AR11_regex. Put your regex patterns in a dict.
d['AR'] is the values in the AR column. It seems like you expect it to be a row.
d['AR'][str(i)] is adding a new row. It seems like you want to add a new column.
Lastly, this approach to setting a cell generally (always for me) yields the following warning:
/var/folders/zj/pnrcbb6n01z2qv1gmsk70b_m0000gn/T/ipykernel_13985/876572204.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The suggest approach would be to use "at" as in d.at[str(i), 'AR'] or some such.
Add a sample data frame and refine your question for more suggestions.
first of all, I'm quite new to programming overall (< 2 Months), so I'm sorry if that's an 'simple, no need to ask for help, try it yourself until you get it done' problem.
I have two data-frames with partially the same content (general overview of mobile-numbers including their cost centers in the company and monthly invoices with the affected mobile-numbers and their invoice amount).
I'd like to compare the content of the 'mobile-numbers' column of the monthly invoices DF to the content of the 'mobile-numbers' column of the general overview DF and if matching, assign the respective cost center to the mobile-number in the monthly invoices DF.
I'd love to share my code with you, but unfortunately I have absolutely zero clue how to solve that problem in any way.
Thanks
Edit: I'm from germany, I tried my best to explain the problem in english. If there is anything I messed up (so u dont get it) just tell me :)
Example of desired result
program meets your needs, in the second dataframe I put the value '40' to demonstrate that the dataframes already filled will not be zeroed, the replacement will only occur if there is a similar value between the dataframes, if you want a better explanation about the program , comment below, and don't forget to vote and mark as solved, I also put some 'prints' for a better view, but in general they are not necessary
import pandas as pd
general_df = pd.DataFrame({"mobile_number": [1234,3456,6545,4534,9874],
"cost_center": ['23F','67F','32W','42W','98W']})
invoice_df = pd.DataFrame({"mobile_number": [4534,5567,1234,4871,1298],
"invoice_amount": ['19,99E','19,99E','19,99E','19,99E','19,99E'],
"cost_center": ['','','','','40']})
print(f"""GENERAL OVERVIEW DF
{general_df}
________________________________________
INVOICE DF
{invoice_df}
_________________________________________
INVOICE RESULT
""")
def func(line):
t = 0
for x in range(0, len(general_df['mobile_number'])):
t = general_df.loc[general_df['mobile_number'] == line[0]]
if t.empty:
return line[2]
else:
return t.values.tolist()[0][1]
invoice_df['cost_center'] = invoice_df.apply(func, axis = 1)
print(invoice_df)
I'm sorting a dataframe containing stock capitalisations from largest to smallest row-wise (I will compute the ratio of top 10 stocks vs the whole market as a proxy for concentration).
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='last').to_numpy(), index=stock_mv.columns)
stock_mv = stock_mv.apply(f, axis=1)
When I do this, however, the column names (tickers) no longer make sense. I read somewhere that you shouldn't delete column names or have them set to the same thing.
What is the best practice thing to do in this situation?
Thank you very much - I am very much a novice coder.
If I understand your problem right, you want to sort a dataframe row-wise. If that's the case, Try this:
stock_mv = stock_mv.sort_values(axis=1, ascending=False)
I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")
I have posted a similar question before, but I thought it'd be better to elaborate it in another way. For example, I have a dataframe of compounds assigned to a number, as it follows:
compound,number
17alpha_beta_SID_24898755,8
2_prolinal_109328,3
4_chloro_4491,37
5HT_144234_01,87
5HT_144234_02,2
6-OHDA_153466,23
Also, there is another dataframe with other properties, as well as the compound names, but not only with its corresponding numbers, there are rows in which the compound names are assigned to different numbers - these cases where there are differences are not of interest:
rmsd,chemplp,plp,compound,number
1.00,14.00,-25.00,17alpha_beta_SID_24898755,7
0.38,12.00,-19.00,17alpha_beta_SID_24898755,8
0.66,16.00,-25.6,17alpha_beta_SID_24898755,9
0.87,24.58,-38.35,2_prolinal_109328,3
0.17,54.58,-39.32,2_prolinal_109328,4
0.22,22.58,-32.35,2_prolinal_109328,5
0.41,45.32,-37.90,4_chloro_4491,37
0.11,15.32,-37.10,4_chloro_4491,38
0.11,15.32,-17.90,4_chloro_4491,39
0.61,38.10,-45.86,5HT_144234_01,85
0.62,18.10,-15.86,5HT_144234_01,86
0.64,28.10,-25.86,5HT_144234_01,87
0.64,16.81,-10.87,5HT_144234_02,2
0.14,16.11,-10.17,5HT_144234_02,3
0.14,16.21,-10.17,5HT_144234_02,4
0.15,31.85,-24.23,6-OHDA_153466,23
0.13,21.85,-34.23,6-OHDA_153466,24
0.11,11.85,-54.23,6-OHDA_153466,25
The problem is that I want to find each compound and its corresponding number from dataframe 1 in dataframe 2, and return its entire row.
I was only able to do this (but due to the way the iteration goes in this case, it doesn't work for what I intend to):
import numpy as np
import csv
import pandas as pd
for c1,n1,c2,n2 in zip(df1.compound,df1.number,df2.compound,df2.number):
if c1==c2 and n1==n2:
print(df2[*])
I wanted to print the entire row in which c1==c2 and n1==n2.
Example: for 17alpha_beta_SID_24898755 (compound) 8 (its number) in dataframe 1, return the row in which this compound and this number is found in dataframe 2. The result should be:
0.38,12.00,-19.00,17alpha_beta_SID_24898755,8
I'd like to do this for all the compounds and its corresponding numbers from dataframe1. The example I gave was only a small set from an extremely extensive list. If anyone could help, thank you!
Take a look at df.merge method:
df1.merge(df2, on=['compound', 'number'], how='inner')