Python: Changing values in a DataFrame - python

I'm new to python and pandas and I need some ideas. Say I have the following DataFrame:
0 1 2 3 4 5
1 5 5 5 5 5
2 5 5 5 5 5
3 5 5 5 5 5
4 5 5 5 5 5
I want to iterate through each row and change the values of specific columns. Say I wanted to change all of the values in columns (2,3,4) to a 3.
This is what I've tried, am I going down the right path?
for row in df.iterrows():
for col in range(2, 4):
df.set_value('row', 'col', 3)
EDIT:
Thanks for the responses. The simple solutions are obvious, but what if I wanted to change the values to this... for example:
0 1 2 3 4 5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
4 16 17 18 19 20

If you are using a loop when working with dataframes, you are almost always not on the right track.
For this you can use a vectorized assignment:
df[[2, 3, 4]] = 3
Example:
df = pd.DataFrame({1: [1, 2], 2: [1, 2]})
print(df)
# 1 2
# 0 1 1
# 1 2 2
df[[1, 2]] = 3
print(df)
# 1 2
# 0 3 3
# 1 3 3

you can do this
df.iloc[:,1] = 3 #columns 2
df.iloc[:,2] = 3
df.iloc[:,3] = 3

Related

Adding a dataframe to another after a loop

I have a portion of my dataframe here:
days = [1, 2, 3, 4, 5]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
df4 = df1.dot(df2.to_frame().T)
df4 =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
3 8 16 8 16 8 16 8 16 8
4 10 20 10 20 10 20 10 20 10
I have an if loop that creates a single row dataframe which looks like:
df_new =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
I need to be able to loop through and add this row to the end of the larger dataframe a handful of times so the end result looks like this:
df_final =
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
3 8 16 8 16 8 16 8 16 8
4 10 20 10 20 10 20 10 20 10
5 2 4 2 4 2 4 2 4 2
6 5 6 7 8 9 8 7 6 5
I have tried to either append or concact the new dataframe to the existing one, but I receive errors both ways. Either indexing errors or a few looping issues. I think I need either a better understanding of why the row cannot be added to the end of the dataframe or an idea of a work around. The loop has 25 iterations where I only added two, but the idea is the same, I will get a new row in the form of a single row dataframe and I need to add that data from the single row dataframe without the column headers to the final dataframe. I am willing to update my question as soon as I get a better idea of how this can work, it does not seem like a difficult task, but I am sure I am asking the wrong thing.

Grid-like dataframe to list

I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8

How to make the data greater than or equal to the previous data

I have a dataframe, and I want every value in a specific column to be greater or equal to the previous value.
In the example below, I want it to be applied on column 'd', so the expected result should replace 5 with 7.
So how can I do it?
The actual situation needs to consider efficiency issues, it is best not to use FOR.
Thank you in advance for your reply.
code:
>>> df = pd.DataFrame({'a': [1,3,2,5], 'b': [2, 4,3,4], 'c':[3,5,4,3]})
>>> df
a b c
0 1 2 3
1 3 4 5
2 2 3 4
3 5 4 3
>>> df['d'] = df['a']+df['b']
>>> df
a b c d
0 1 2 3 3
1 3 4 5 7
2 2 3 4 5
3 5 4 3 9
You can do this using cummax:
>>> df['d'] = df['d'].cummax()
>>> df
a b c d
0 1 2 3 3
1 3 4 5 7
2 2 3 4 7
3 5 4 3 9

Pandas how to output distinct values in column based on duplicate in another column

Here an example:
import pandas as pd
df = pd.DataFrame({
'product':['1','1','1','2','2','2','3','3','3','4','4','4','5','5','5'],
'value':['a','a','a','a','a','b','a','b','a','b','b','b','a','a','a']
})
product value
0 1 a
1 1 a
2 1 a
3 2 a
4 2 a
5 2 b
6 3 a
7 3 b
8 3 a
9 4 b
10 4 b
11 4 b
12 5 a
13 5 a
14 5 a
I need to output:
1 a
4 b
5 a
Because 'value' values for distinct 'product' values all are same
I'm sorry for bad English
I think you need this
m=df.groupby('product')['value'].transform('nunique')
df.loc[m==1].drop_duplicates(). reset_index(drop=True)
Output
product value
0 1 a
1 4 b
2 5 a
Details
df.groupby('product')['value'].transform('nunique') returns a series as below
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 1
10 1
11 1
12 1
13 1
14 1
where the numbers of the number of unique values in each group. Then we use df.loc to get only the rows in which this value is 1, so, the groups with unique values.
The we drop duplicates since you need only the group & its unique value.
If I undestand correctly your question, this simple code is for your:
distinct_prod_df = df.drop_duplicates(['product'])
and gives:
product value
0 1 a
3 2 a
6 3 a
9 4 b
12 5 a
You can try this:
mask = df.groupby('product').apply(lambda x: x.nunique() == 1)
df = df[mask].drop_duplicates()

Adding a Dataframe to a multindex dataframe

I'm trying to create a historical time-series of a number of identifiers for a number of different metrics, as part of that i'm trying to create multi index dataframe and then "fill it" with the individual dataframes.
Multi Index:
ID1 ID2
ITEM1 ITEM2 ITEM1 ITEM2
index
Dataframe to insert
ITEM1 ITEM2
Date
a
b
c
looking through the official docs and this website i found the following relevant:
Add single index data frame to multi index data frame, Pandas, Python and the associated pandas official docs pages:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
i've managed with something like :
for i in df1.index:
for j in df2.columns:
df1.loc[i,(ID,j)]=df2.loc[i,j]
but it seems highly inefficient when i need to do this across circa 100 dataframes.
for some reason a simply
df1.loc[i,(ID)]=df2.loc[i] doesn't seem to work
neither does :
df1[ID1]=df1.append(df2)
which returns a Cannot set a frame with no defined index and a value that cannot be converted to a Series
my understanding from looking around is that this is because im effectively leaving half the dataframe empty ( ragged list? )
any help appreciated on how to iteratively populate my multi index DF would be greatly appreciated.
let me know if i've missed relevant information,
cheers.
Setup
df1 = pd.DataFrame(
[[1, 2, 3, 4, 5, 6] * 2] * 3,
columns=pd.MultiIndex.from_product(['ID1 ID2 ID3'.split(), range(4)])
)
df2 = df1.ID1 * 2
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 1 2 3 4 5 6
1 1 2 3 4 5 6 1 2 3 4 5 6
2 1 2 3 4 5 6 1 2 3 4 5 6
df2
0 1 2 3
0 2 4 6 8
1 2 4 6 8
2 2 4 6 8
The problem is that Pandas is trying to line up indices (or columns in this case). We can do some transpose/join trickery but I'd rather avoid that.
Option 1
Take advantage of the fact that we can assign via loc an array so long as the shape matches up. Well, we better make sure it does and that the order of columns and index are correct. I use align with the right parameter to do this. Then assign the values of the aligned df2
df1.loc[:, 'ID1'] = df2.align(df1.ID1, 'right')[0].values
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 2
Or, we can give df2 the additional level of column indexing that we need to lined it up. The use update to replace the relevant cells in place.
df1.update(pd.concat({'ID1': df2}, axis=1))
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 3
A creative way using stack and assign with unstack
df1.stack().assign(ID1=df2.stack()).unstack()
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6

Categories