i wanted to sort the length of the message of the column without adding new column in dataframes.tried the below method and didnt work ..is there any way to sort values based on the any custom function.
df.sort_values(df['message'].apply(len),ascending=False)
Regards,
Michael
You can use the len() of the string (message) as the key parameter in sort_values().
Consider a random df:
df = pd.DataFrame({'messages':['come here please as I need you','why would i come there','fine i will be there soon']})
df
messages
0 come here please as I need you
1 why would i come there
2 fine i will be there soon
Use:
df.sort_values(by='messages', key=lambda x: x.str.len(),ascending=False,inplace=True)
df
messages
0 come here please as I need you
2 file i will be there soon
1 why would i come there
So you were almost there.
For more information on the parameters on sort_values check this link.
l1=['ab','effg','hjj','klllh','m','n','abbc']
df=pd.DataFrame(data={'message':l1})
df['message']=list(df['message'][df['messsage'].apply(lambda x:len(x)).sort_values(ascending=False).index])
print(df)
get the sorted index and apply this index on messages
https://i.stack.imgur.com/iyxzL.png
Related
I can't find a method to loop over my data frame (df_yf) and extract all the "Close" prices and create a new df_adj. The df is group_by coin price.
Initially, I tried something like but throwing me error.
for i in range(len(df_yf.columns):
df_adj.append(df_yf[i]["Close"])
Also tried using .get and .filter but throws me errors
"list indices must be integers or slices, not str; perhaps you missed
a comma?"
EDIT!!
Thank you for the answers. It made me realize my mistake :D. I shouldn't group_by tickers so I changed it to group_by prices (Low, Close etc.) and then was able to simply extract the right columns by doing df_adj = df_yf["Close"] as was mentioned
df_adj = np.array(df_yf["Close"])
dataframe from tables will use dict to extract columns, and then use values to get ndarray form.
df_adj = df_yf["Close"].values
If you group by Tickers, you could use:
df_adj = pd.DataFrame()
for i in [ticker[0] for ticker in df_yf]:
df_adj[i] = df_yf[i]['Close']
Result:
Ticker1 Ticker2 Ticker3
0 1 1 1
1 3 3 3
I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")
I have a Pandas dataframe containing tweets. I want to count the number of tweets that have been retweeted.
This code does not work
tweets_retweeted = twitter.apply(lambda x:True if x.retweet_count > 0 else False)
count_of_tweets_retweeted = len(tweets_retweeted[tweets_retweeted == True].index)
The error message I get is
KeyError: ('retweet_count', 'occurred at index created_at')
Without having the ability to recreate your example, there are a few things that could be going on.
The error is likely coming from the 1st line where you are trying to access the column.
You may be passing one column at a time to the apply function rather than one row at a time. Please use axis = 1 to pass each row to see if it works.
Also, just a best practice (in my humble opinion) is to not reference column names with the dot notation. Try to use the bracket notation to differentiate between column names and methods.
Can you do:
j = twitter['retweet_count'] > 0
j.value_counts()
I am new to Python and I'm trying to produce a similar result of Excel's IndexMatch function with Python & Pandas, though I'm struggling to get it working.
Basically, I have 2 separate DataFrames:
The first DataFrame ('market') has 7 columns, though I only need 3 of those columns for this exercise ('symbol', 'date', 'close'). This df has 13,948,340 rows.
The second DataFrame ('transactions') has 14 columns, though only I only need 2 of those columns ('i_symbol', 'acceptance_date'). This df has 1,428,026 rows.
My logic is: If i_symbol is equal to symbol and acceptance_date is equal to date: print symbol, date & close. This should be easy.
I have achieved it with iterrows() but because of the size of the dataset, it returns a single result every 3 minutes - which means I would have to run the script for 1,190 hours to get the final result.
Based on what I have read online, itertuples should be a faster approach, but I am currently getting an error:
ValueError: too many values to unpack (expected 2)
This is the code I have written (which currently produces the above ValueError):
for i_symbol, acceptance_date in transactions.itertuples(index=False):
for symbol, date in market.itertuples(index=False):
if i_symbol == symbol and acceptance_date == date:
print(market.symbol + market.date + market.close)
2 questions:
Is itertuples() the best/fastest approach? If so, how can I get the above working?
Does anyone know a better way? Would indexing work? Should I use an external db (e.g. mysql) instead?
Thanks, Matt
Regarding question 1: pandas.itertuples() yields one namedtuple for each row. You can either unpack these like standard tuples or access the tuple elements by name:
for t in transactions.itertuples(index=False):
for m in market.itertuples(index=False):
if t.i_symbol == m.symbol and t.acceptance_date == m.date:
print(m.symbol + m.date + m.close)
(I did not test this with data frames of your size but I'm pretty sure it's still painfully slow)
Regarding question 2: You can simply merge both data frames on symbol and date.
Rename your "transactions" DataFrame so that it also has columns named "symbol" and "date":
transactions = transactions[['i_symbol', 'acceptance_date']]
transactions.columns = ['symbol','date']
Then merge both DataFrames on symbol and date:
result = pd.merge(market, transactions, on=['symbol','date'])
The result DataFrame consists of one row for each symbol/date combination which exists in both DataFrames. The operation only takes a few seconds on my machine with DataFrames of your size.
#Parfait provided the best answer below as a comment. Very clean, worked incredibly fast - thank you.
pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol',
'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol',
'acceptance_date']).
No need for looping.
I have some data that I want to analyze. I group my data by the relevant group variables (here, 'test_condition' and 'region') and analyze the measure variable ('rt') with a function I wrote:
grouped = data.groupby(['test_condition', 'region'])['rt'].apply(summarize)
That works fine. The output looks like this (fake data):
ci1 ci2 mean
test_condition region
Test Condition Name And 0 295.055978 338.857066 316.956522
Spill1 0 296.210167 357.036210 326.623188
Spill2 0 292.955327 329.435977 311.195652
The problem is, 'test_condition' and 'region' are not actual columns, I can't index into them. I just want columns with the names of the group variables! This seems so simple (and is automatically done in R's ddply) but after lots of googling I have come up with nothing. Does anyone have a simple solution?
By default, the grouping variables are turned into an index. You can change the index to columns with grouped.reset_index().
My second suggestion to specify this in the groupby call with as_index=False, seems not to work as desired in this case with apply (but it does work when using aggregate)