using set() with pandas - python

May I ask you please if we can use set() to read the data in a specific column in pandas? For example, I have the following output from a DataFrame df1:
df1= [
0 -10 2 5
1 24 5 10
2 30 3 6
3 30 2 1
4 30 4 5
]
where the first column is the index.. I tried first to isolate the second column
[-10
24
30
30
30]
using the following: x = pd.DataFrame(df1, coulmn=[0]) Then, I transposed the column using the following XX = x.T Then, I used set() function.
However, instead of obtaining [-10 24 30] I got the following [0 1 2 3 4]
So set() read the index instead of reading the first column

set() takes an itterable.
using a pandas dataframe as an itterable yields the column names in turn.
Since you've transposed the dataframe, your index values are now column names, so when you use the transposed dataframe as an itterable you get those index values.
If you want to use set to get the values in the column using set() you can use:
x = pd.DataFrame(df1, colmns=[0])
set(x.iloc[:,0].values)
But if you just want the unique values in column 0 then you can use
df1[[0]].unique()

Related

Columns getting appended to wrong row in pandas

So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Iterate through given dataframe

df =
0 20
1 19
2 18
3 17
4 16
I am iterating with a loop:
for k in df:
af = AffinityPropagation(preference=k).fit(X)
labels = af.labels_
score = silhouette_score(frechet, labels)
print("Preference: {0}, Silhouette score: {1}".format(k,score))
I get 1 number. But I need/want to get dataframe with numbers in the length of df len(df)
You need to use iterrows as #CodeDifferently points out in his comment above.
Here is an example:
Where df is:
df = pd.DataFrame({0:range(20,0,-1)})
Then using your method:
for k in df:
print(k)
Output:
0
This zero is the column header for a dataframe. You are iterating thow the dataframe column names.
Using iterrows:
for _,k in df.iterrows():
print(k.iloc[0])
Output:
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Here you are getting each row of the dataframe as series, and using iloc you are getting the first and only value in the rows for this case.
You almost never need to iterate over a DataFrame. Columns are basically NumPy arrays and have array-like 'elementwise' superpowers. (You ~never need to iterate over NumPy arrays either.)
Maybe formulate your task as a function and use the apply() method on the DataFrame or Series. This 'applies' a function to every item in a column without the need for a loop.
But if you really only have one column like this, why use a DataFrame at all? Just use a NumPy array (or get at it with the column's values attribute).

Multiply columns with rows by matching column name and row name in Python / Pandas

I have a data frame which looks like this
> data
A B
1 1 2
2 2 1
I have a reference data frame which looks like this
> ref
Names Values
1 A 5
2 B 10
I want to multiply each column by corresponding row in Ref having same Name
the result should be this
> result
A B
1 5 20
2 10 10
What is the fastest way to achieve this in Python? Any help would be greatly appreciated
You may want to check mul
df.mul(ref.set_index('Names').Values)
Out[137]:
A B
1 5 20
2 10 10
Your reference dataframe ref can be represented as a Series as follows or with ref.set_index('Names')['Values']
s = pd.Series([5, 10], index=['A', 'B'])
Your data dataframe is as follows:
df = pd.DataFrame(dict(A=[1,2], B=[2,1]))
Multiplying the two with df * s produces the desired output because the indexing of each object is used to determine which arrays get multiplied together.

Pandas: Get top 10 values AFTER grouping

I have a pandas data frame with a column 'id' and a column 'value'. It is already sorted by first id (ascending) and then value (descending). What I need is the top 10 values per id.
I assumed that something like the following would work, but it doesn't:
df.groupby("id", as_index=False).aggregate(lambda (index,rows) : rows.iloc[:10])
What I get is just a list of ids, the value column (and other columns that I omitted for the question) aren't there anymore.
Any ideas how it might be done, without iterating through each of the single rows and appending the first ten to another data structure?
Is this what you're looking for?
df.groupby('id').head(10)
I would like to answer this by giving and example dataframe as:
df = pd.DataFrame(np.array([['a','a','b','c','a','c','b'],[4,6,1,8,9,4,1],[12,11,7,1,5,5,7],[123,54,146,96,10,114,200]]).T,columns=['item','date','hour','value'])
df['value'] = pd.to_numeric(df['value'])
This gives you a dataframe
item date hour value
a 4 12 123
a 6 11 54
b 1 7 146
c 8 1 96
a 9 5 10
c 4 5 114
b 1 7 200
Now this is grouped below and displays first 2 values of grouped items.
df.groupby(['item'])['value'].head(2)

Categories