I have a dataframe where every line is ranked on several attributes vs. all the other rows. A single line can have the same rank in 2 attributes (meaning a row can be the best in few attributes) like shown in row 2 and 3 below:
att_1 att_2 att_3 att_4
ID
984 5 3 1 46
794 1 1 99 34
6471 20 2 3 2
Per line, I want to keep the index (ID) and the cell with the lowest value - in case there is more than 1 cell, I have to select a random one to keep a normal distribution.
I managed to convert the df into a numpy array and run the following:
idx = np.argmin(h_data.values, axis=1)
But I get the first line every time..
Desired output:
ID MIN
984 att_3
794 att_2
6471 att_1
Thank you!
Use list comprehension with numpy.random.choice:
df['MIN'] = [np.random.choice(df.columns[x == x.min()], 1)[0] for x in df.values]
print (df)
att_1 att_2 att_3 att_4 MIN
ID
984 5 3 1 46 att_3
794 1 1 99 34 att_1
6471 20 2 3 2 att_2
I you want to do something for each row (or column), you should try the .apply method
df.apply(np.argmin, axis=1) #row wise
df.apply(np.argmin, axis=0) #column wise
Related
Imagine I have a dataframe with these variables and values:
ID
Weight
LR Weight
UR Weight
Age
LS Age
US Age
Height
LS Height
US Height
1
63
50
80
20
18
21
165
160
175
2
75
50
80
22
18
21
172
160
170
3
49
45
80
17
18
21
180
160
180
I want to create the additional following variables:
ID
Flag_Weight
Flag_Age
Flag_Height
1
1
1
1
2
1
0
0
3
1
0
1
These flags simbolize that the main variable values (e.g.: Weight, Age and Height) are between the correspondent Lower or Upper limits, which may start with different 2 digits (in this dataframe I gave four examples: LR, UR, LS, US, but in my real dataframe I have more), and whose limit values sometimes differ from ID to ID.
Can you help me create these flags, please?
Thank you in advance.
You can use reshaping using a temporary MultiIndex:
(df.set_index('ID')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract('(^[LU]?).*?\s*(\S+)$')),
axis=1)
)
.stack()
.assign(flag=lambda d: d[''].between(d['L'], d['U']).astype(int))
['flag'].unstack().add_prefix('Flag_').reset_index()
)
Output:
ID Flag_Age Flag_Height Flag_Weight
0 1 1 1 1
1 2 0 0 1
2 3 0 1 1
So, if I understood correctly, you want to add columns with these new variables. The simplest solution to this would be df.insert().
You could make it something like this:
df.insert(number of column after which you want to insert the new column, name of the column, values of the new column)
You can make up the new values in pretty much everyway you can imagine. So just copying a column or simple mathematical operations like +,-,*,/, can be performed. But you can also apply a whole function, which returns the flags based on your conditions as values of the new column.
If the new columsn can just be appended, you can even just make up a new column like this:
df['new column name'] = any values you want
I hope this helped.
When I want to see a df finding the null values in a dataset here is what I get.
df.isnull().sum()
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
dtype: int64
Next when I create a dataframe using this df, I get it as below.
df2 = pd.DataFrame(df.isnull().sum())
df2.set_index(0)
df2.index.name = None
0
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
Why is that extra row coming in the output, and how can I remove it?. I saw a normal test, with df and i am able to use it (using set_index(0), and df.index.name = None and was able to remove the extra row. But that does not work on the created dataframe df2.
As you may already know, that extra zero appearing as an "extra row" in your output is actually the header for the column name(s). When you create the DataFrame, try passing a column name if you want something more descriptive than the default "0" for column name:
df2 = pd.DataFrame(df.isnull().sum(), columns=["Null_Counts"])
Same as the difference you would get from these two variants:
print(pd.DataFrame([0,1,2,3,4,5]))
0
0 0
1 1
2 2
3 3
4 4
5 5
vs
print(pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"]))
My_Column
0 0
1 1
2 2
3 3
4 4
5 5
And, if you just don't want the header row to show up in your output, which seems to be the intent of your question, then you could do something like this to just use the index values and the count values to create whatever output format you want:
df1 = pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"])
for tpl in zip(df1.index.values, df1["My_Column"].values):
print("{}\t{}".format(tpl[0], tpl[1]))
Output:
0 0
1 1
2 2
3 3
4 4
5 5
And you can also use the DataFrame function to_csv() and pass header=False if you just want to print or save the CSV output somewhere without the header row:
print(df1.to_csv(header=False))
0,0
1,1
2,2
3,3
4,4
5,5
And you can also pass sep="\t" to the to_csv function call if you prefer tab- instead of comma-delimited output.
I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values.
This is how my dataset looks (plus about 50 variables):
a b
346 17
76 52
459 70
680 96
679 167
246 180
What I wish to achieve is:
a b
346 17
76 35
459 18
680 26
679 71
246 13
I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?
Finding cumulative features in dataframe?
What I do at the moment is run the following code in order to give me the feature's names with cumulative values:
def accmulate_col(value):
count = 0
count_1 = False
name = []
for i in range(len(value)-1):
if value[i+1]-value[i] >= 0:
count += 1
if value[i+1]-value[i] > 0:
count_1 = True
name.append(1) if count == len(value)-1 and count_1 else name.append(0)
return name
df.apply(accmulate_col)
Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:
df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))
Is there a better way to solve my problem?
To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.
With that out of the way, given a dataframe such as:
import pandas as pd
d = {'a': [1,2,3,4],
'b': [4,3,2,1]
}
df = pd.DataFrame(d)
#Output:
a b
0 1 4
1 2 3
2 3 2
3 4 1
Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.
That can be written as:
out = (df.diff().dropna()>0).all()
#Output:
a True
b False
dtype: bool
Then, you can just use the column names to select only those with True in them
new_df = df[df.columns[out]]
#Output:
a
0 1
1 2
2 3
3 4
*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)
I'm new to python and I'm trying to find the entries from the first column that have in the second column all the entries I'm searching for. ex: I want entries {155, 137} and I expect to get 5 and 6 from column id1 in return.
id1 id2
----------
1. 10
2. 10
3. 10
4. 9
5. 137
5. 150
5. 155
6. 10
6. 137
6. 155
....
I've searched a lot on google, but couldn't solve it. I read these entries from an excel, I tried with multiple for loops, but it doesn't look nice because I'm searching for a lot of entries
I tried this:
df = pd.read_excel('path/temp.xlsx') #now I have two Columns and many rows
d1 = df.values.T[0].tolist()
d2 = df.values.T[1].tolist()
d1[d2.index(115) & d2.index(187)& d2.index(276) & d2.index(239) & d2.index(200) & d2.index(24) & d2.index(83)]
and it returned 1
I started to work this week, so I'm very new
Assuming you have two lists for both the IDs (i.e. one list for id1, and one for id2), and the lists correspond to each other (that means, the value at index i in list1 corresponds to the value at index i of list2).
If that is your case, then you simply have to find out the index of the element, you want to search for, and the corresponding index in the other list will be the answer to your query.
To get the index of the element, you can use Python's inbuilt feature to get an index, namely:
list.index(<element>)
It will return the zero-based index of the element you wanted in the list.
To get the corresponding ID from id1, you can simply use this index (because of one-one correspondence). In your case, it can be written as:
id1[id2.index(137)] #it will return 5
NOTE:
index() method will return the index of the first matching entry from the list.
best to use pandas
import pandas as pd
import numpy as np
Random data
n = 10
I = [i for i in range(1,7)]
df1 = pd.DataFrame({'Id1': [Id[np.random.randint(len(I))] for i in range(n)],
'Id2': np.random.randint(0,1000,n)})
df1.head(5)
Id1 Id2
0 4 170
1 6 170
2 6 479
3 4 413
4 6 52
Query using
df1.loc[~df1.Id2.isin([170,479])].Id1
Out[344]:
3 4
4 6
5 6
6 3
7 1
8 5
9 6
Name: Id1, dtype: int64
for now, I've solved it by doing this
I have a file with 1M rows that I'm trying to read into 20 DataFrames. I do not know in advance which row belongs to which DataFrame or how large each DataFrame will be. How can I process this file into DataFrames efficiently? I've tried to do this several different ways. Here is what I currently have:
data = pd.read_csv(r'train.data', sep=" ", header = None) # Not slow
def collectData(row):
id = row[0]
df = dictionary[id] # Row content determines which dataframe this row belongs to
next = len(df.index)
df.loc[next] = row
data.apply(collectData, axis=1)
It's very slow. What am I doing wrong? If I just apply an empty function, my code runs in 30 sec. With the actual function it takes at least 10 minutes and I'm not sure if it would finish.
Here are a few sample rows from the dataset:
1 1 4
1 2 2
1 3 10
1 4 4
The full dataset is available here (if you click on Matlab version)
Your approach is not a vectored one, because you apply a python function row by row.
Rather that creating 20 dataframes , make a dictionary containing an index (in range(20)) for each key[0]. Then add this information to your DataFrame:
data['dict']=data[0].map(dictionary)
Then reorganize :
data2=data.reset_index().set_index(['dict','index'])
data2 is like :
0 1 2
dict index
12 0 1 1 4
1 1 2 2
2 1 3 10
3 1 4 4
4 1 5 2
....
and data2.loc[i] is one of the Dataframe you want.
EDIT:
It seems that dictionary is describe in train.label.
You can set the dictionary before by:
with open(r'train.label') as f: u=f.readlines()
v=[int(x) for x in u] # len(v) = 11269 = data[0].max()
dictionary=dict(zip(range(1,len(v)+1),v))
Since, the full data set is easily loaded into memory, the following should be fairly quick
data_split = {i: data[data[0] == i] for i in range(1, 21)}
# to access each dataframe, do a dictionary lookup, i.e.
data_split[2].head()
0 1 2
769 2 12 4
770 2 16 2
771 2 23 4
772 2 27 2
773 2 29 6
you may also want to reset the indices or copy the data frame when you're slicing the data frame into smaller data frames.
additional reading:
copy
reset_index
view-vs-copy
If you want to build them efficiently, I think you need some good raw materials:
wood
cement
Are robust and durable.
Try to avoid using hay as the dataframe can be blown up with a little wind.
Hope that helps