I have a large dataset trying to read with Pandas. I am trying to split the value of one of the column in two parts and check if there is any overlapping values between these sets. With the codes below the result is there are some value overlapping in array 'b' and array 'c'. I want to get those values specifically but don't know how? Can anybody point me in the right direction?
df = pd.read_csv('....csv')
df2 = df[df['Freq']>= 280]
a=df2['Ring'].values
b=df2['Ring'].drop_duplicates().values
df3 = df[df['Freq']<= 280]
df3['Ring'].values
c=df3['Ring'].drop_duplicates().values
if np.all(b) == np.all(c):
print ("They are overlapping")
else:
print ("They are not overlapping")
Based on the example provided, you can do the following:
import numpy as np
np.intersect1d(b, c)
or you can also do something like:
cond = df['Freq'] >= 280
np.intersect1d(df[cond]['Ring'], df[~cond]['Ring'])
Related
i wanna change all the cells inside my data frame that are string and value = ''.
I have one data set that has 7 columns .
for example:
a,b,c,d,e,f,g.
and has 700 rows.
i wanna change the value of the cells in specific 5 columns in one code.
I tried this:
columns = [a,b,c,d,e]
def get_tmp(i):
if len(i) == 0:
b ='tmp'
return b
else:
return i
weights_df[colun] = weights_df[colun].apply(get_tmp)
but this don't function.
to fix the problem i used a looping for:
columns = [a,b,c,d,e]
def get_tmp(i):
if len(i) == 0:
b ='tmp'
return b
else:
return i
for colun in columns:
weights_df[colun] = weights_df[colun].apply(get_tmp)
have another way to fix this situation using only .apply?
if have, Do i need change somethin in my function ? what i need change ?
thank you guys.
You can try this code.
import pandas as pd
import numpy as np
filename = 'Book1.xlsx'
weights_df= pd.read_excel(filename)
columns = ['a','b','c','d','e']
for col in columns:
weights_df[col] =
weights_df[col].apply(lambda x: 'tmp' if x=='' else x)
This code is working in my local.
thank you Bright Gene, it's a good answer.
in reality i wanna check if is possible to do the same changes with out the looping for.
I will be more clear, maybe i did some miscommunication.
this data frame have this columns a,b,c,d,e,f,g.
i wanna change only a party of these columns:
a,b,c,d,e.
the other are numbers.
I created a list: columns_to_modify= [a,b,c,d,e]
I wanna try to change inside like this:
weights_df[columns_to_modify] = weights_df[columns_to_modify].apply(lambda x: 'tmp' if x=='' else x)
at this moment i wanna understand if have a way to apply only to a specific columns whit out lopping for.
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
My program contains many different NumPy arrays, with various data inside each of them. An example of an array is:
x = [5, 'ADC01', Input1, 25000], # Where [TypeID, Type, Input, Counts]
[5, 'ADC01', Input2, 40000]
From separate arrays I can retrieve the value of Type and Input. I then need to say
Counts = x[0,3] where Type = 'ADC01' and Input = 'Input2'
Obviously it would not be wrote like this. For the times that I have only needed to satisfy one condition, I have used:
InstType_ID = int(InstInv_Data[InstInv_Data[:,glo.inv_InstanceName] == Instrument_Type_L][0,glo.inv_TypeID])
Here, it looks in array(InstInv_Data) at the 'InstanceName' column and finds a match to Instrument_Type. It then assigns the 'TypeID' column to InstType_ID. I basically want to add an and statement so it also looks for another matching piece of data in another column.
Edit: I just thought that I could try and do this in two separate steps. Returning both Input and Counts columns where Type-Column = Type. However, I am unsure of how to actually return two columns, instead of a specific one. Something like this:
Intermediate_Counts = (InstDef_Data[InstDef_Data[:,glo.i_Type] == Instrument_Type_L][0,(glo.i_Input, glo.i_Counts])
You could use a & b to perform element-wise AND for two boolean arrays a, b:
selected_rows = x[(x[:,1] == 'ADC01') & (x[:,2] == 'Input2')]
Similarly, use a | b for OR and ~a for NOT.
I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()
I have a dataframe that consists of multiple columns. I want to select rows based on conditions in multiple columns. Assuming that I have four columns in a dataframe:
import pandas as pd
di={"A":[1,2,3,4,5],
"B":['Tokyo','Madrid','Professor','helsinki','Tokyo Oliveira'],
"C":['250','200//250','250//250//200','12','200//300'],
"D":['Left','Right','Left','Right','Right']}
data=pd.DataFrame(di)
I want to select Tokyo in column B, 200 in column C, Left in column D. By that, the first row will be only selected. I have to create a function to handle column C. Since I need to check the first value if the row contains a list with //
To handle this, I assume this can be done through the following:
def check_200(thecolumn):
thelist=[]
for i in thecolumn:
f=i
if "//" in f:
#split based on //
z=f.split("//")
f=z[0]
f=float(f)
if f > 200.00:
thelist.append(True)
else:
thelist.append(False)
return thelist
Then, I will create the multiple conditions:
selecteddata=data[(data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left"))&(check_200(data.C))]
Is this the best way to do that, or there is an easier pandas function that can handle such requirements ?
I don't think there is a most pythonic way to do this, but I think this is what you want:
bool_idx = ((data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left")) & (data.C.str.contains("//")
& (data.C.str.split("//")[0].astype(float)>200.00))
selecteddata=data[bool_idx]
Bruno's answer does the job, and I agree that boolean masking is the way to go. This answer keeps the code a little closer to the requested format.
import numpy as np
def col_condition(col):
col = col.apply(lambda x: float(x.split('//')[0]) > 200)
return col
data = data[(data.B.str.contains('Tokyo')) & (data.D.str.contains("Left")) &
col_condition(data.C)]
The function reads in a Series, and converts each element to True or False, depending on the condition. It then returns this mask.