Python: How to transform a column data csv to row data - python

I have a csv file that looks like this:
Signal Channel
-60 1
-40 6
-70 11
-80 3
-80 4
-66 1
-60 11
-50 6
I want to create a new csv file using those data in matrix form:
channel 1 | channel 2 | channel 3 | channel 4 | channel 5 | channel 6 | channel 7 | channel 8 | channel 9 | channel 10 | channel 11
-60 | | -80 | -80 | | -40 | | | | | -70
-66 | | | | | -50 | | | | | -60
But I don't know how to do this.

I think you can manage with, you just have to put the args you want in the to_csv function to make it display as you want:
import pandas as pd
data={"Signal":[-60,-40,-70,-80,-80,-66,-60,-50],
"Channel":[1,6,11,3,4,1,11,6]}
df=pd.DataFrame(data)
df['count']=df.groupby('Signal')['Signal'].cumcount()
pivot=pd.pivot_table(df,values=["Signal"],columns=["Channel"],index=['count'])
pivot=pivot.add_prefix('Channel_')
pivot.to_csv("test.csv",index=False)

You can use df.pivot() from pandas (see here). First cumcount is used to determine the index location for the pivoting.
from io import StringIO
import pandas as pd
csv = """
Signal,Channel
-60,1
-40,6
-70,11
-80,3
-80,4
-66,1
-60,11
-50,6"""
df = pd.read_csv(StringIO(csv))
df['index'] = df.groupby('Channel').cumcount()
df.pivot(index='index', columns="Channel", values="Signal")
gives you
Channel 1 3 4 6 11
index
0 -60.0 -80.0 -80.0 -40.0 -70.0
1 -66.0 NaN NaN -50.0 -60.0
This answer is adapted from #unutbu's answer here

Related

How to obtain counts and sums for pairs of values in each row of Pandas DataFrame

Problem:
I have a DataFrame like so:
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
name | category | amount
----------------------------
0 john | a | 100
1 jim | b | 200
2 eric | c | 13
3 jim | b | 23
4 john | a | 40
5 jim | b | 2
6 jim | c | 43
7 eric | c | 92
8 eric | a | 83
9 john | c | 1
I would like to add two new columns: first; the total amount for the relevant category for the name of the row (eg: the value in row 0 would be 140, because john has a total of 100 + 40 of the a category). Second; the counts of those name and category combinations which are being summed in the first new column (eg: the row 0 value would be 2).
Desired output:
The output I'm looking for here looks like this:
name | category | amount | sum_for_category | count_for_category
------------------------------------------------------------------------
0 john | a | 100 | 140 | 2
1 jim | b | 200 | 225 | 3
2 eric | c | 13 | 105 | 2
3 jim | b | 23 | 225 | 3
4 john | a | 40 | 140 | 2
5 jim | b | 2 | 225 | 3
6 jim | c | 43 | 43 | 1
7 eric | c | 92 | 105 | 2
8 eric | a | 83 | 83 | 1
9 john | c | 1 | 1 | 1
I don't want to group the data by the features because I want to keep the same number of rows. I just want to tag on the desired value for each row.
Best I could do:
I can't find a good way to do this. The best I've been able to come up with is the following:
names = df["name"].unique()
categories = df["category"].unique()
sum_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].sum() for j in categories
} for i in names}
df["sum_for_category"] = df.apply(lambda x: sum_for_category[x["name"]][x["category"]],axis=1)
count_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].count() for j in categories
} for i in names}
df["count_for_category"] = df.apply(lambda x: count_for_category[x["name"]][x["category"]],axis=1)
But this is extremely clunky and slow; far too slow to be viable on my actual dataset (roughly 700,000 rows x 10 columns). I'm sure there's a better and faster way to do this... Many thanks in advance.
You need two groupby.transform:
g = df.groupby(['name', 'category'])['amount']
df['sum_for_category'] = g.transform('sum')
df['count_or_category'] = g.transform('size')
output:
name category amount sum_for_category count_or_category
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
Another possible solution:
g = df.groupby(['name', 'category']).amount.agg(['sum','count']).reset_index()
df.merge(g, on = ['name', 'category'], how = 'left')
Output:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df_Count =
df.groupby(['name','category']).count().reset_index().rename({'amount':'Count_For_Category'}, axis=1)
df_Sum = df.groupby(['name','category']).sum().reset_index().rename({'amount':'Sum_For_Category'},axis=1)
df_v2 = pd.merge(df,df_Count[['name','category','Count_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2 = pd.merge(df_v2,df_Sum[['name','category','Sum_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2
Hi There,
Use a simple code to easy understand, please try these code below, Just run it you will get what you want.
Thanks
Leon

Calculates new columns based on other columns' values in python pandas dataframe

I want to create a new column based on the value of other columns in pandas dataframe. My data is about a truck that moves back and forth from loading to dumping location. I want calculates the distance of current road segment to the last segment. The example of the data shown below:
State | segment length |
-----------------------------
Loaded | 20 |
Loaded | 10 |
Loaded | 10 |
Empty | 15 |
Empty | 10 |
Empty | 10 |
Loaded | 30 |
Loaded | 20 |
Loaded | 10 |
So, the end of the road will be the record where the State changes. Hence I want to calculate the distance from end of the road. The final dataframe will be:
State | segment length | Distance to end
Loaded | 20 | 40
Loaded | 10 | 20
Loaded | 10 | 10
Empty | 15 | 35
Empty | 10 | 20
Empty | 10 | 10
Loaded | 30 | 60
Loaded | 20 | 30
Loaded | 10 | 10
Can anyone help?
Thank you in advance
Use GroupBy.cumsum with DataFrame.iloc for swap ordering and custom Series for get unique consecutive groups with shift and cumsum:
g = df['State'].ne(df['State'].shift()).cumsum()
df['Distance to end'] = df.iloc[::-1].groupby(g)['segment length'].cumsum()
print (df)
State segment length Distance to end
0 Loaded 20 40
1 Loaded 10 20
2 Loaded 10 10
3 Empty 15 35
4 Empty 10 20
5 Empty 10 10
6 Loaded 30 60
7 Loaded 20 30
8 Loaded 10 10
Detail:
print (g)
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Name: State, dtype: int32
df['Distance to end'] = (
df.assign(i=df.State.ne(df.State.shift()).cumsum())
.assign(s=lambda x: x.groupby(by='i')['segment length'].transform(sum))
.groupby(by='i')
.apply(lambda x: x.s.sub(x['segment length'].shift().cumsum().fillna(0)))
.values
)
State segment length Distance to end
0 Loaded 20 40.0
1 Loaded 10 20.0
2 Loaded 10 10.0
3 Empty 15 35.0
4 Empty 10 20.0
5 Empty 10 10.0
6 Loaded 30 60.0
7 Loaded 20 30.0
8 Loaded 10 10.0

Histogram with custom y frequency python

I am trying to plot the following data
+-----------+------+------+
| Duration | Code | Seq. |
+-----------+------+------+
| 116.15 | 65 | 1 |
| 120.45 | 65 | 1 |
| 118.92 | 65 | 1 |
| 7.02 | 66 | 1 |
| 73.93 | 66 | 2 |
| 117.53 | 66 | 1 |
| 4.4 | 66 | 2 |
| 111.03 | 66 | 1 |
| 4.35 | 66 | 1 |
+-----------+------+------+
I have my code as:
x1 = df.loc[df.Code==65, 'Duration']
x2 = df.loc[df.Code==66, 'Duration']
kwargs = dict(alpha=0.5, bins=10)
plt.hist(x1, **kwargs, color='k', label='Code 65')
plt.hist(x2, **kwargs, color='g', label='Code 66')
What I ideally want on the y axis is the number of Seq.corresponding to different Durationson x axis. But now, I only get the count of the Durationson y. How do I correct this?
You might bin the 'x' values using pandas and then use a bar chart instead.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Duration':[116.15, 120.45,118.92,7.02,73.93, 117.53, 4.4, 111.03, 4.35]})
df['Code'] = [65,65,65,66,66,66,66,66,66]
df['Seq.'] = [1,1,1,1,2,1,2,1,1]
df
Duration Code Seq.
0 116.15 65 1
1 120.45 65 1
2 118.92 65 1
3 7.02 66 1
4 73.93 66 2
5 117.53 66 1
6 4.40 66 2
7 111.03 66 1
8 4.35 66 1
df['bin'] = pd.cut(df['Duration'],10, labels=False)
df
Duration Code Seq. bin
0 116.15 65 1 9
1 120.45 65 1 9
2 118.92 65 1 9
3 7.02 66 1 0
4 73.93 66 2 5
5 117.53 66 1 9
6 4.40 66 2 0
7 111.03 66 1 9
8 4.35 66 1 0
x1 = df.loc[df.Code==65, 'bin']
x2 = df.loc[df.Code==66, 'bin']
y1 = df.loc[df.Code==65, 'Seq.']
y2 = df.loc[df.Code==66, 'Seq.']
plt.bar(x1, y1)
plt.bar(x2, y2)
plt.show()

Plots shifting in heatmaps in Seaborn Facetgrid

Sorry in advance the number of images, but they help demonstrate the issue
I have built a dataframe which contains film thickness measurements, for a number of substrates, for a number of layers, as function of coordinates:
| | Sub | Result | Layer | Row | Col |
|----|-----|--------|-------|-----|-----|
| 0 | 1 | 2.95 | 3 - H | 0 | 72 |
| 1 | 1 | 2.97 | 3 - V | 0 | 72 |
| 2 | 1 | 0.96 | 1 - H | 0 | 72 |
| 3 | 1 | 3.03 | 3 - H | -42 | 48 |
| 4 | 1 | 3.04 | 3 - V | -42 | 48 |
| 5 | 1 | 1.06 | 1 - H | -42 | 48 |
| 6 | 1 | 3.06 | 3 - H | 42 | 48 |
| 7 | 1 | 3.09 | 3 - V | 42 | 48 |
| 8 | 1 | 1.38 | 1 - H | 42 | 48 |
| 9 | 1 | 3.05 | 3 - H | -21 | 24 |
| 10 | 1 | 3.08 | 3 - V | -21 | 24 |
| 11 | 1 | 1.07 | 1 - H | -21 | 24 |
| 12 | 1 | 3.06 | 3 - H | 21 | 24 |
| 13 | 1 | 3.09 | 3 - V | 21 | 24 |
| 14 | 1 | 1.05 | 1 - H | 21 | 24 |
| 15 | 1 | 3.01 | 3 - H | -63 | 0 |
| 16 | 1 | 3.02 | 3 - V | -63 | 0 |
and this continues for >10 subs (per batch), and 13 sites per sub, and for 3 layers - this df is a composite.
I am attempting to present the data as a facetgrid of heatmaps (adapting code from How to make heatmap square in Seaborn FacetGrid - thanks!)
I can plot a subset of the df quite happily:
spam = df.loc[df.Sub== 6].loc[df.Layer == '3 - H']
spam_p= spam.pivot(index='Row', columns='Col', values='Result')
sns.heatmap(spam_p, cmap="plasma")
BUT - there are some missing results, where the layer measurement errors (returns '10000') so I've replaced these with NaNs:
df.Result.replace(10000, np.nan)
To plot a facetgrid to show all subs/layers, I've written the following code:
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
sns.heatmap(d, **kwargs)
fig = sns.FacetGrid(spam, row='Wafer',
col='Feature', height=5, aspect=1)
fig.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False, cmap="plasma", annot=True, annot_kws={"size": 20})
which yields:
It has automatically adjusted axes to not show any positions where there is a NaN.
I have tried masking (see https://github.com/mwaskom/seaborn/issues/375) but just errors out with Inconsistent shape between the condition and the input (got (237, 15) and (7, 7)).
And the result of this is, when not using the cropped down dataset (i.e. df instead of spam, the code generates the following Facetgrid):
Plots featuring missing values at extreme (edge) coordinate positions make the plot shift within the axes - here all apparently to the upper left. Sub #5, layer 3-H should look like:
i.e. blanks in the places where there are NaNs.
Why is the facetgrid shifting the entire plot up and/or left? The alternative is dynamically generating subplots based on a sub/layer-count (ugh!).
Any help very gratefully received.
Full dataset for 2 layers of sub 5:
Sub Result Layer Row Col
0 5 2.987 3 - H 0 72
1 5 0.001 1 - H 0 72
2 5 1.184 3 - H -42 48
3 5 1.023 1 - H -42 48
4 5 3.045 3 - H 42 48
5 5 0.282 1 - H 42 48
6 5 3.083 3 - H -21 24
7 5 0.34 1 - H -21 24
8 5 3.07 3 - H 21 24
9 5 0.41 1 - H 21 24
10 5 NaN 3 - H -63 0
11 5 NaN 1 - H -63 0
12 5 3.086 3 - H 0 0
13 5 0.309 1 - H 0 0
14 5 0.179 3 - H 63 0
15 5 0.455 1 - H 63 0
16 5 3.067 3 - H -21 -24
17 5 0.136 1 - H -21 -24
18 5 1.907 3 - H 21 -24
19 5 1.018 1 - H 21 -24
20 5 NaN 3 - H -42 -48
21 5 NaN 1 - H -42 -48
22 5 NaN 3 - H 42 -48
23 5 NaN 1 - H 42 -48
24 5 NaN 3 - H 0 -72
25 5 NaN 1 - H 0 -72
You may create a list of unique column and row labels and reindex the pivot table with them.
cols = df["Col"].unique()
rows = df["Row"].unique()
pivot = data.pivot(...).reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
as seen in this answer.
Some complete code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
r = np.repeat([0,-2,2,-1,1,-3],2)
row = np.concatenate((r, [0]*2, -r[::-1]))
c = np.array([72]*2+[48]*4 + [24]*4 + [0]* 3)
col = np.concatenate((c,-c[::-1]))
df = pd.DataFrame({"Result" : np.random.rand(26),
"Layer" : list("AB")*13,
"Row" : row, "Col" : col})
df1 = df.copy()
df1["Sub"] = [5]*len(df1)
df1.at[10:11,"Result"] = np.NaN
df1.at[20:,"Result"] = np.NaN
df2 = df.copy()
df2["Sub"] = [3]*len(df2)
df2.at[0:2,"Result"] = np.NaN
df = pd.concat([df1,df2])
cols = np.unique(df["Col"].values)
rows = np.unique(df["Row"].values)
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
d = d.reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
print d
sns.heatmap(d, **kwargs)
grid = sns.FacetGrid(df, row='Sub', col='Layer', height=3.5, aspect=1 )
grid.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False,
cmap="plasma", annot=True)
plt.show()

Differences in one column based on differences in another, pandas

How can I perform the below manipulation with pandas?
I have this dataframe :
weight | Date | dateDay
43 | 09/03/2018 08:48:48 | 09/03/2018
30 | 10/03/2018 23:28:48 | 10/03/2018
45 | 12/03/2018 04:21:44 | 12/03/2018
25 | 17/03/2018 00:23:32 | 17/03/2018
35 | 18/03/2018 04:49:01 | 18/03/2018
39 | 19/03/2018 20:14:37 | 19/03/2018
I want this :
weight | Date | dateDay | Fun_Cum
43 | 09/03/2018 08:48:48 | 09/03/2018 | NULL
30 | 10/03/2018 23:28:48 | 10/03/2018 | -13
45 | 12/03/2018 04:21:44 | 12/03/2018 | NULL
25 | 17/03/2018 00:23:32 | 17/03/2018 | NULL
35 | 18/03/2018 04:49:01 | 18/03/2018 | 10
39 | 19/03/2018 20:14:37 | 19/03/2018 | 4
Pseudo code:
If Day does not follow Day-1 => Fun_Cum is NULL;
Else (weight day) - (weight day-1)
Thank you
This is one way using pd.Series.diff and pd.Series.shift. You can take the difference between consecutive datetime elements and access pd.Series.dt.days attribute.
df['Fun_Cum'] = df['weight'].diff()
df.loc[(df.dateDay - df.dateDay.shift()).dt.days != 1, 'Fun_Cum'] = np.nan
print(df)
weight Date dateDay Fun_Cum
0 43 2018-03-09 2018-03-09 NaN
1 30 2018-03-10 2018-03-10 -13.0
2 45 2018-03-12 2018-03-12 NaN
3 25 2018-03-17 2018-03-17 NaN
4 35 2018-03-18 2018-03-18 10.0
5 39 2018-03-19 2018-03-19 4.0
#import pandas as pd
#from datetime import datetime
#to_datetime = lambda d: datetime.strptime(d, '%d/%m/%Y')
#df = pd.read_csv('d.csv', converters={'dateDay': to_datetime})
Above part only if you reading from the file, else its just .shift() what u need
a = df
b = df.shift()
df["Fun_Cum"] = (a.weight - b.weight) * ((a.dateDay - b.dateDay).dt.days ==1)

Categories