how to set first column to index using iloc[:,0] - python

I have a dataframe,and want to set first column to index using iloc[:,0],but something's wrong.
I apply iloc[:,0] to set first column to index.
data12 = pd.DataFrame({"b":["a","h","r","e","a"],
"a":range(5)})
data2 = data12.set_index(data12.iloc[:,0])
data2
b a
b
a a 0
h h 1
r r 2
e e 3
a a 4
I want to get the follwing result.
a
b
a 0
h 1
r 2
e 3
a 4
thank you very much

Use the name of the Series, not the Series itself.
data12.set_index(data12.iloc[:, 0].name) # or data12.columns[0]
a
b
a 0
h 1
r 2
e 3
a 4
From the documentation for set_index
keys This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index and np.ndarray.
You need to pass a key, not an array if you want to set the index and have the respective Series no longer included as a column of the DataFrame.

Related

Saving small sub-dataframes containing all values associated to a specific 'key' string

I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.

How to lookup value in another table in Python

I have two (actually many, but stick with two) datasets and I need to merge them together. However, they are not same range and they have different reference values. Lets consider
a 1
b 2
c 3
e 4
and
a 2
b 3
d 7
e 2
I tried to simulate Excel index and match function, but I am not able to get the right result
b = []
f = []
for i in data1["c1"]:
if i in data2["c1"]:
a = d3[data2["c4"].index[i]]
f = b.append(a)
else:
continue
print(f)
Can you please help me how this works? I would also welcome some link with further information about this topic. Thank you
If you want to create a consolidated file from the two above like:
Col1 Col2 Col3
a 1 2
b 2 3
c 3 7
d 4 2
You can simply use dictionaries, with keys as your column 1 values: a, b, c, d and values as list of the 2nd column values from your two DataFrames respectively like:
your_dict = {a:[1,2], b:[2,3], c:[3,7], d:[4,2]}
Then to output that into one DataFrame such as the one above, just use the .from_dict() method in pandas with the orient parameter equal to 'index' see documentation here.

How to get values on one dataframe based on the position of a value in other dataframe

I have two dataframes with the same size.
df1
1 5 3
6 5 1
2 4 9
df2
a b c
d e f
g h i
I want to get the corresponding value on df2 that is in the same position as the maximum value of each row in df1. For example, row 0 has element [0,1] as its max, so I'd like to get [0,1] from df2 in return
Desired result would be:
df3
b
d
i
Thank you so much!
Don't use for loops. numpy can be handy here
vals = df2.values[np.arange(len(df2)), df1.values.argmax(1)]
Of course, can df3 = pd.DataFrame(vals)
col
0 b
1 d
2 i
S=df1.idxmax(axis=0)
p=0
for a in range(len(df1):
df3.iloc(['a','0'])=df2.iloc([S[p],0])
p+=1
Try the code:
>>> for i, j in enumerate(df1.idxmax()):
... print(df2.iloc[i, j])
...
b
d
i
idxmax gives the id of the maximum value in the dataframe, either row-wise or column-wise.
Your problem has two parts:
1- Finding the maximum value of each row
2- Choosing the maximum column of each row with values found in step one
You can easily use lookup function. The first argument is finding the maximum column in rows(step one), and the second is the selection(step two)
df2.lookup(range(len(df1)), df1.idxmax()) #output => array(['b', 'd', 'i'], dtype=object)
If array does not work for you, you can also create data frame from these values if by simply passing it to pd.DataFrame:
pd.DataFrame(df2.lookup(range(len(df1)), df1.idxmax()))
One good feature of this solution is avoiding loops which makes it efficient.

Mapping rows of a Pandas dataframe to numpy array

Sorry, I know there are so many questions relating to indexing, and it's probably starring me in the face, but I'm having a little trouble with this. I am familiar with .loc, .iloc, and .index methods and slicing in general. The method .reset_index may not have been (and may not be able to be) called on our dataframe and therefore index lables may not be in order. The dataframe and numpy array(s) are actually different length subsets of the dataframe, but for this example I'll keep them the same size (I can handle offsetting once I have an example).
Here is a picture that show's what I'm looking for:
I can pull cols of rows from the dataframe based on some search criteria.
idxlbls = df.index[df['timestamp'] == dt]
stuff = df.loc[idxlbls, 'col3':'col5']
But how do I map that to row number (array indices, not label indices) to be used as an array index in numpy (assuming same row length)?
stuffprime = array[?, ?]
The reason I need it is because the dataframe is much larger and more complete and contains the column searching criteria, but the numpy arrays are subsets that have been extracted and modified prior in the pipeline (and do not have the same searching criteria in them). I need to search the dataframe and pull the equivalent data from the numpy arrays. Basically I need to correlate specific rows from a dataframe to the corresponding rows of a numpy array.
I would map pandas indices to numpy indicies:
keys_dict = dict(zip(idxlbls, range(len(idxlbls))))
Then you may use the dictionary keys_dict to address the array elements by a pandas index: array[keys_dict[some_df_index], :]
I believe need get_indexer for positions by filtered columns names, for index is possible use same way or numpy.where for positions by boolean mask:
df = pd.DataFrame({'timestamp':list('abadef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]}, index=list('ABCDEF'))
print (df)
timestamp B C D E
A a 4 7 1 5
B b 5 8 3 3
C a 4 9 5 6
D d 5 4 7 9
E e 5 2 1 2
F f 4 3 0 4
idxlbls = df.index[df['timestamp'] == 'a']
stuff = df.loc[idxlbls, 'C':'E']
print (stuff)
C D E
A 7 1 5
C 9 5 6
a = df.index.get_indexer(stuff.index)
Or get positions by boolean mask:
a = np.where(df['timestamp'] == 'a')[0]
print (a)
[0 2]
b = df.columns.get_indexer(stuff.columns)
print (b)
[2 3 4]

Create value if missing for this identifier

I want to solve a problem that essentially boils down to this:
I have identifier numbers (thousands of them) and each should be uniquely linked to a set of letters. Let's call them a through e. These can be filled from another column (y) if that helps.
Ocassionally one of the letters is missing and is registered as NAN. How can I replace such that I get all the required numbers.
Idnumber X y
1 a a
2 a a
1 b b
1 NaN d
2 b NaN
1 d c
2 c NaN
1 NaN e
2 d d
2 e e
Any given x can be missing.
The dataset it too big to simply add all posibilities and drop dupplicates.
The idea is to get:
Idnumber X
1 a
2 a
1 b
1 c
2 b
1 d
2 c
1 e
2 d
2 e
The main issue is getting a unique solution. So making sure that I replace one NaN by c and one by e.
Is this what you're looking for? Or does this use too much RAM? If it does use too much RAM, you can use the chunksize parameter in read_csv. Then write results (with duplicates and nans dropped) for each individual chunk to csv, then load those and drop duplicates again - this time just dropping duplicates that conflict across chunks.
#Loading Dataframe
from StringIO import StringIO
x=StringIO('''Idnumber,X,y
1,a,a
2,a,a
1,b,b
1,NaN,d
2,b,NaN
1,d,c
2,c,NaN
1,NaN,e
2,d,d
2,e,e''')
#Operations on Dataframe
df = pd.read_csv(x)
df1 = df[['Idnumber','X']]
df2 = df[['Idnumber','y']]
df2.rename(columns={'y': 'X'}, inplace=True)
pd.concat([df1,df2]).dropna().drop_duplicates()

Categories