I have a csv data file that contains data for recordID, duration, src, dst in each row.
I want to label each row(in a new column), with either a 0 or a 1 depending on the output of my algorithm.
I'm currently doing something like this, however, once outputting the DataFrame to a csv file, it deleted all the other, exiting columns.
Another issue is that this solution is extraordinarily slow. I thought of creating a simple array for array and then add that entire array as a new column, but I don't know how to do that either.
df2 = pd.read_csv(f_path2, names=["record ID", "duration_", "src_bytes", "dst_bytes", "label"], header=None)
df2 = df2.dropna()
df2.head()
for source, dest, label in X_test_scaled:
predict = kmeans.predict([[source, dest]])
df2.at[total, 'label'] = predict # total as index
How do I do this correctly - actually update my existing file without rewriting the other columns, and faster?
This is a guess since it is not really clear what your data looks like. But it seems that running kmeans.predict for the entire list at once might speed things up. You could then assign the list of predictions to a column in your dataframe:
df2['label'] = kmeans.predict([[source, dest] for source, dest, label in X_test_scaled])
Your answer isn't precised - to provide solution - what I can conclude with that info:
You can use apply() with loc
In the loc you have access to every row - it's work like iterator of all rows.
Inside predictorFunction - based on another column you can return everything (in this case just execute your predictor)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
def predictorFunction(currentRow):
print(currentRow["record ID"])
//return kmeans.predict([[currentRow["columnNameA"], currentRow["columnNameB"]]])
df2 = df['Predict'].apply(lambda x: func(x))
Related
I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)
I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")
I'm trying to build a search based results, where in I will have an input dataframe having one row and I want to compare with another dataframe having almost 1 million rows. I'm using a package called Record Linkage
However, I'm not able to handle typos. Lets say I have "HSBC" in my original data and the user types it as "HKSBC", I want to return "HSBC" results only. On comparing the string similarity distance with jarowinkler I get the following results:
from pyjarowinkler import distance
distance.get_jaro_distance("hksbc", "hsbc", winkler=True, scaling=0.1)
>> 0.94
However, I'm not able to give "HSBC" as an output, so I want to create a new column in my pandas dataframe where in I'll compute the string similarity scores and take that part of the score which has a score above a particular threshold.
Also, the main bottleneck is that I have almost 1 million data, so I need to compute it really fast.
P.S. I have no intentions of using fuzzywuzzy, preferable either of Jaccard or Jaro-Winkler
P.P.S. Any other ideas to handle typos for search based thing is also acceptable
I was able to solve it through record linkage only. So basically it does an initial indexing and generates candidate links (You can refer to the documentation on "SortedNeighbourhoodindexing" for more info), i.e. it does a multi-indexing between the two dataframes that needs to be compared, which I did manually.
So here is my code:
import recordlinkage
df['index'] = 1 # this will be static since I'll have only one input value
df['index_2'] = range(1, len(df)+1)
df.set_index(['index', 'index_2'], inplace=True)
candidate_links=df.index
df.reset_index(drop=True, inplace=True)
df.index = range(1, len(df)+1)
# once the candidate links has been generated you need to reset the index and compare with the input dataframe which basically has only one static index, i.e. 1
compare_cl = recordlinkage.Compare()
compare_cl.string('Name', 'Name', label='Name', method='jarowinkler') # 'Name' is the column name which is there in both the dataframe
features = compare_cl.compute(candidate_links,df_input,df) # df_input is the i/p df having only one index value since it will always have only one row
print(features)
Name
index index_2
1 13446 0.494444
13447 0.420833
13469 0.517949
Now I can give a filter like this:
features = features[features['Name'] > 0.9] # setting the threshold which will filter away my not-so-close names.
Then,
df = df[df['index'].isin(features['index_2'])
This will sort my results and give me the final dataframe which has a name score greater than a particular threshold set by the user.
I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.
A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09
Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]
Say I construct a dataframe with pandas, having multi-indexed columns:
mi = pd.MultiIndex.from_product([['trial_1', 'trial_2', 'trial_3'], ['motor_neuron','afferent_neuron','interneuron'], ['time','voltage','calcium']])
ind = np.arange(1,11)
df = pd.DataFrame(np.random.randn(10,27),index=ind, columns=mi)
Link to image of output dataframe
Say I want only the voltage data from trial 1. I know that the following code fails, because the indices are not sorted lexically:
idx = pd.IndexSlice
df.loc[:,idx['trial_1',:,'voltage']]
As explained in another post, the solution is to sort the dataframe's indices, which works as expected:
dfSorted = df.sortlevel(axis=1)
dfSorted.loc[:,idx['trial_1',:,'voltage']]
I understand why this is necessary. However, say I want to add a new column:
dfSorted.loc[:,('trial_1','interneuron','scaledTime')] = 100 * dfSorted.loc[:,('trial_1','interneuron','time')]
Now dfSorted is not sorted anymore, since the new column was tacked onto the end, rather than snuggled into order. Again, I have to call sortlevel before selecting multiple columns.
I feel this makes for repetitive, bug-prone code, especially when adding lots of columns to the much bigger dataframe in my own project. Is there a (preferably clean-looking) way of inserting new columns in lexical order without having to call sortlevel over and over again?
One approach would be to use filter which does a text filter on the column names:
In [117]: df['trial_1'].filter(like='voltage')
Out[117]:
motor_neuron afferent_neuron interneuron
voltage voltage voltage
1 -0.548699 0.986121 -1.339783
2 -1.320589 -0.509410 -0.529686