Grid-like dataframe to list - python

I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?

Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5

You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8

Related

counting consequtive duplicate elements in a dataframe and storing them in a new colum

I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used this:
a=[1,1,3,3,3,5,6,3,3,0,0,0,2,2,2,0]
df = pd.DataFrame(list(zip(a)), columns =['Patch'])
df['count'] = df.groupby('Patch').Patch.transform('size')
print(df)
this gave me a result like this:
Patch count
0 1 2
1 1 2
2 3 5
3 3 5
4 3 5
5 5 1
6 6 1
7 3 5
8 3 5
9 0 4
10 0 4
11 0 4
12 2 3
13 2 3
14 2 3
15 0 4
however i want the result to be like this:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
.droplevel(level=0, axis=1)
.rename(columns={"first": "Patch"})
)
print(df)
Prints:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1

Python dataframe add columns in groups of 3

I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42

Replace all column values with value from single column - Pandas

I'm hoping to replace values in all columns within a df using integers from a specified column. Using the df below I want to use the values in Code and replace them in all other columns.
df = pd.DataFrame({
'Place' : ['X','Y','X','Y','X','Y','X','Y'],
'Number' : ['A','B','C','D','F','G','H','I'],
'Code' : [1,2,3,0,1,2,5,4],
'Value' : ['','','','','','','','']
})
df[:] = df['Code'].apply(lambda x: x if np.isreal(x) else 0).astype(int)
print(df)
Intended Output:
Place Number Code Value
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 0 0 0 0
4 1 1 1 1
5 2 2 2 2
6 5 5 5 5
7 4 4 4 4
Use reindex, ffill, bfill
df[['Code']].reindex(columns=df.columns).ffill(1).bfill(1).astype(int)
Out[256]:
Place Number Code Value
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 0 0 0 0
4 1 1 1 1
5 2 2 2 2
6 5 5 5 5
7 4 4 4 4
Numpy solution
df[:] = np.transpose([df.Code] * df.shape[1])
Out[314]:
Place Number Code Value
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 0 0 0 0
4 1 1 1 1
5 2 2 2 2
6 5 5 5 5
7 4 4 4 4
Try this:
df[df.columns] = df[['Code', 'Code', 'Code', 'Code']]
or:
df[df.columns] = df[['Code']*len(df.columns)]
Hope it helps you.

Pandas dataframe - running sum within cluster

I have
x cluster_id
0 1 1
1 3 1
2 2 2
3 5 2
4 4 3
I want to generate
x cluster_id s
0 1 1 1
1 3 1 4
2 2 2 3
3 5 2 7
4 4 3 4
i.e. s is the running sum of x, but it gets reset when the cluster id changes. How is this achieved?
Alternatively, if it is easier, it may be Ok to do
x cluster_id s
0 1 1 4
1 3 1 4
2 2 2 7
3 5 2 7
4 4 3 4
i.e. all values for s within the same cluster are the same, and correspond to the total sum in the cluster.
Additionally, I want to subsample this so that I keep the last row of each cluster:
x cluster_id s
1 3 1 4
3 5 2 7
4 4 3 4
(note that all the cluster ids are different). How can I do this?
You can get the running totals using .cumsum() with .groupby()
>>> df
x cluster_id
0 1 1
1 3 1
2 2 2
3 5 2
4 4 3
>>> df['s'] = df.groupby('cluster_id').cumsum()
>>> df
x cluster_id s
0 1 1 1
1 3 1 4
2 2 2 2
3 5 2 7
4 4 3 4
Then to get only the last row for each cluster_id:
>>> df.groupby('cluster_id').last().reset_index()
cluster_id x s
0 1 3 4
1 2 5 7
2 3 4 4

Python: Replace a cell value in Dataframe with if statement

I have a matrix with that looks like this:
com 0 1 2 3 4 5
AAA 0 5 0 4 2 1 4
ABC 0 9 8 9 1 0 3
ADE 1 4 3 5 1 0 1
BCD 1 6 7 8 3 4 1
BCF 2 3 4 2 1 3 0 ...
Where AAA, ABC ... is the dataframe index. The dataframe columns are com 0 1 3 4 5 6
I want to set the cell values in my dataframe equal to 0 when the row values of com is equal the column "number". So for instance, the above matrix will look like:
com 0 1 2 3 4 5
AAA 0 0 0 4 2 1 4
ABC 0 0 8 9 1 0 3
ADE 1 4 0 5 1 0 1
BCD 1 6 0 8 3 4 1
BCF 2 3 4 0 1 3 0 ...
I tried to iterate over rows and use both .loc and .ix but no success.
Just require some numpy trick
In [22]:
print df
0 1 2 3 4 5
0 5 0 4 2 1 4
0 9 8 9 1 0 3
1 4 3 5 1 0 1
1 6 7 8 3 4 1
2 3 4 2 1 3 0
[5 rows x 6 columns]
In [23]:
#making a masking matrix, 0 where column and index values equal, 1 elsewhere, kind of the vectorized way of doing if TURE 0, else 1
print df*np.where(df.columns.values==df.index.values[..., np.newaxis], 0,1)
0 1 2 3 4 5
0 0 0 4 2 1 4
0 0 8 9 1 0 3
1 4 0 5 1 0 1
1 6 0 8 3 4 1
2 3 4 0 1 3 0
[5 rows x 6 columns]
I think this should work.
for line in range(len(matrix)):
matrix[matrix[line][0]+1]=0
NOTE
Depending on your matrix setup you may not need the +1
Basically it takes the first digit of each line in the matrix and uses that as the index of the value to change to 0
i.e. if the row was
c 0 1 2 3 4 5
AAA 4 3 2 3 9 5 9,
it would change the 5 below the number 4 to 0
c 0 1 2 3 4 5
AAA 4 3 2 3 9 0 9

Categories