I have an existing dataframe like this
>>> print(dataframe)
sid
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Name: name, Length: 87, dtype: int64
I want to add a row like {'sid': 2, '': 100} to it but when I try this
df = pandas.DataFrame({'sid': [2], '': [100]})
df = df.set_index('sid')
dataframe = dataframe.append(df)
print(dataframe)
I end up with
sid
30 11.0 NaN
56 5.0 NaN
73 25.0 NaN
78 2.0 NaN
132 1.0 NaN
... ... ...
8616 2.0 NaN
9049 1.0 NaN
9125 6.0 NaN
9316 11.0 NaN
2 NaN 100.0
I'm hoping for something more like
sid
2 100
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Any idea how I can achieve that?
The way to do this was
dataframe.loc[2] = 100
Thanks anky!
Reason for the above problem, because at the time you have appended two DataFrames, you forgot to set 'sid' as the dataframe index. So, basically the two DataFrames has different structure when you append it. Make sure to set the index of both dataframes same before you append them.
data = [ [30,11], [56, 5], [73, 25]] #test dataframe
dataframe = pd.DataFrame(data, columns=['sid', ''])
dataframe = dataframe.set_index('sid')
print(dataframe)
You get,
sid
30 11
56 5
73 25
Create and set the index of df,
df = pd.DataFrame({'sid' : [2], '' : [100]})
df = df.set_index('sid')
You get,
sid
2 100
Then append them,
dataframe = df.append(dataframe)
print(dataframe)
You will get the disired outcome,
sid
2 100
30 11
56 5
73 25
Related
We have data from a device that measures multiple parts, and it outputs multiple measurements for each part into a CSV file. We read the CSV file into a dataframe with a structure such as this:
PartNo 12
Meas1 45
Meas2 23
!END
PartNo 13
Meas1 63
Meas2 73
!END
PartNo 12
Meas1 82
Meas2 84
!END
The "!END" flag indicates where the data from one part ends, and the next part starts.
We would like to reshape the data so it looks like:
PartNo Meas1 Meas2
12 45 23
13 63 73
12 82 84
(Note that a part could appear more than once - so there is no field that is guaranteed to be unique across all records.)
A pivot produces:
0 !END Meas1 Meas2 PartNo
0 NaN NaN NaN 12.0
1 NaN 45.0 NaN NaN
2 NaN NaN 23.0 NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN 13.0
5 NaN 63.0 NaN NaN
6 NaN NaN 73.0 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN 12.0
9 NaN 82.0 NaN NaN
10 NaN NaN 84.0 NaN
11 NaN NaN NaN NaN
How do I squeeze these rows down to group by PartNo?
A transpose produces:
0 1 2 3 4 5 6 7 8 9 10 11
0 PartNo Meas1 Meas2 !END PartNo Meas1 Meas2 !END PartNo Meas1 Meas2 !END
1 12 45 23 NaN 13 63 73 NaN 12 82 84 NaN
How could I reset the row every 4th item?
I could create a new index column in the original dataframe, and then iterate through the rows incrementing the index for every row with !END (and then use the index to group the data),
but it seems that there ought to be a more elegant shape shifting function to handle this case, or maybe there is an argument to Pivot or Transpose that would handle this. I am a Python beginner.
Here is the full code:
import pandas as pd
from io import StringIO
tdata = (
'PartNo, 12\n'
'Meas1, 45\n'
'Meas2, 23\n'
'!END\n'
'PartNo, 13\n'
'Meas1, 63\n'
'Meas2, 73\n'
'!END\n'
'PartNo, 12\n'
'Meas1, 82\n'
'Meas2, 84\n'
'!END\n')
tdf = pd.read_csv(StringIO(tdata), header=None)
print(tdf)
print(tdf.pivot(index=None, columns=0, values=1))
print(tdf.T)
#having dataframe x:
>>> x = pd.DataFrame([['PartNo',12],['Meas1',45],['Meas2',23],['!END',''],['PartNo',13],['Meas1',63],['Meas2',73],['!END',''],['PartNo',12],['Meas1',82],['Meas2',84],['!END','']])
>>> x
0 1
0 PartNo 12
1 Meas1 45
2 Meas2 23
3 !END
4 PartNo 13
5 Meas1 63
6 Meas2 73
7 !END
8 PartNo 12
9 Meas1 82
10 Meas2 84
11 !END
#grouping by first column, and aggregating values to list. First column then contains Series that you want. By converting each list in this series to series, dataframe is created, then you just need to transpose
>>> df = x.groupby(0).agg(lambda x: list(x))[1].apply(lambda x: pd.Series(x)).transpose()
>>> df[['PartNo','Meas1','Meas2']]
0 PartNo Meas1 Meas2
0 12 45 23
1 13 63 73
2 12 82 84
The file is not a csv file, so parsing it with the csv module cannot produce a correct output. It is not a well known format, so I would use a custom parser:
with open(filename) as fd:
data = []
row = None
for line in fd:
line = line.strip()
if line == '!END':
row = None
else:
k,v = line.split(None, 1)
if row is None:
row = {k : v}
data.append(row)
else:
row[k] = v
header = set(i for row in data for i in row.keys())
df = pd.DataFrame(data, columns=header)
based on the information provided, I think you should be able to achieve what you want using this approach:
df = df[df[0] != '!END']
out = df.groupby(0).agg(list).T.apply(lambda x: x.explode(), axis=0)
output:
0 Meas1 Meas2 PartNo
1 45 23 12
1 63 73 13
1 82 84 12
This essentially groups the original df by the PartNo, Meas1 and Meas2 keys and makes a list for each..then it explodes each list into a pd.Series, thus making a column for each, with # of rows equal to number of entries in each key (should all be same)
Here is how I would do it. I would parse the file as any text file and then create a record based on fields I need. I would use the '!END' row as an indicator for completion of row creation to write it into a list and then ultimately convert list to a DataFrame
import pandas as pd
filename='PartDetail.csv'
with open(filename,'r') as file:
LinesFromFile=file.readlines()
RowToWrite=[]
for EachLine in LinesFromFile:
ValuePosition=EachLine.find(" ")+1
CurrentAttrib=EachLine[0:ValuePosition-1]
if CurrentAttrib=='PartNo':
PartNo=EachLine[ValuePosition+1:len(EachLine)-1].strip()
if CurrentAttrib=='Meas1':
Meas1=EachLine[ValuePosition+1:len(EachLine)-1].strip()
if CurrentAttrib=='Meas2':
Meas2=EachLine[ValuePosition+1:len(EachLine)-1].strip()
if EachLine[0:4]=='!END':
RowToWrite.append([PartNo,Meas1,Meas2])
PartsDataDF=pd.DataFrame(RowToWrite,columns=['PartNo','Meas1','Meas2']) #Converting to DataFrame
This will give you a cleaner DataFrame as below:-
Hope it helps.
I want to calculate the mean of columns a,b,c,d of the dataframe BUT if one of four values in each dataframe row differs more then 20% from this mean (of the four values), the mean has to be set to NaN.
Calculation of the mean of 4 columns is easy, but I'm stuck at defining the condition 'if mean*0.8 <= one of the values in the data row <= mean*1,2 then mean == NaN.
In the example, one or more of the values in ID:5 en ID:87 don't fit in the interval and therefore the mean is set to NaN.
(NaN-values in the initial dataframe are ignored when calculating the mean and when applying the 20%-condition to the calculated mean)
So I'm trying to calculate the mean only for the data rows with no 'outliers'.
Initial df:
ID a b c d
2 31 32 31 31
5 33 52 159 2
7 51 NaN 52 51
87 30 52 421 2
90 10 11 10 11
102 41 42 NaN 42
Desired df:
ID a b c d mean
2 31 32 31 31 31.25
5 33 52 159 2 NaN
7 51 NaN 52 51 51.33
87 30 52 421 2 NaN
90 10 11 10 11 10.50
102 41 42 NaN 42 41.67
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,5,7,87,90,102],
"a": [31,33,51,30,10,41],
"b": [32,52,np.nan,52,11,42],
"c": [31,159,52,421,10,np.nan],
"d": [31,2,51,2,11,42]})
print(df)
a = df.loc[:, ['a','b','c','d']]
df['mean'] = (a.iloc[:,0:]).mean(1)
print(df)
b = df.mean.values[:,None]*0.8 < a.values[:,:] < df.mean.values[:,None]*1.2
print(b)
...
Try this:
# extract related information
s = df.iloc[:,1:]
# calculate mean
mean = s.mean(1)
# where condition is violated
mask = s.lt(mean*.8, axis=0) | s.gt(mean*1.2, axis=0)
# mask where mask is True on any row
df['mean'] = mean.mask(mask.any(1))
Output:
ID a b c d mean
0 2 31 32.0 31.0 31 31.250000
1 5 33 52.0 159.0 2 NaN
2 7 51 NaN 52.0 51 51.333333
3 87 30 52.0 421.0 2 NaN
4 90 10 11.0 10.0 11 10.500000
5 102 41 42.0 NaN 42 41.666667
I have a data frame with 2 columns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
A B
0 11 10
1 61 30
2 24 54
3 47 52
4 72 42
... ... ...
95 61 2
96 67 41
97 95 30
98 29 66
99 49 22
100 rows × 2 columns
Now I want to create a third column, which is a rolling window max of col 'A' BUT
the max has to be lower than the corresponding value in col 'B'. In other words I want the value of the 4 (using a window size of 4) in column 'A' closest to the value in col 'B', yet smaller than B
So for example in row
3 47 52
the new value I am looking for, is not 61 but 47, because it is the highest value of the 4 that is not higher than 52
pseudo code
df['C'] = df['A'].rolling(window=4).max() where < df['B']
You can use concat + shift to create a wide DataFrame with the previous values, which makes complicated rolling calculations a bit easier.
Sample Data
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 2)), columns=list('AB'))
Code
N = 4
# End slice ensures same default min_periods behavior to `.rolling`
df1 = pd.concat([df['A'].shift(i).rename(i) for i in range(N)], axis=1).iloc[N-1:]
# Remove values larger than B, then find the max of remaining.
df['C'] = df1.where(df1.lt(df.B, axis=0)).max(1)
print(df.head(15))
A B C
0 51 92 NaN # Missing b/c min_periods
1 14 71 NaN # Missing b/c min_periods
2 60 20 NaN # Missing b/c min_periods
3 82 86 82.0
4 74 74 60.0
5 87 99 87.0
6 23 2 NaN # Missing b/c 82, 74, 87, 23 all > 2
7 21 52 23.0 # Max of 21, 23, 87, 74 which is < 52
8 1 87 23.0
9 29 37 29.0
10 1 63 29.0
11 59 20 1.0
12 32 75 59.0
13 57 21 1.0
14 88 48 32.0
You can use a custom function to .apply to the rolling window. In this case, you can use a default argument to pass in the B column.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=('AB'))
def rollup(a, B=df.B):
ix = a.index.max()
b = B[ix]
return a[a<b].max()
df['C'] = df.A.rolling(4).apply(rollup)
df
# returns:
A B C
0 8 17 NaN
1 23 84 NaN
2 75 84 NaN
3 86 24 23.0
4 52 83 75.0
.. .. .. ...
95 38 22 NaN
96 53 48 38.0
97 45 4 NaN
98 3 92 53.0
99 91 86 53.0
The NaN values occur when no number in the window of A is less than B or at the start of the series when the window is too big for the first few rows.
You can use where to replace values that don't fulfill the condition with np.nan and then use rolling(window=4, min_periods=1):
In [37]: df['C'] = df['A'].where(df['A'] < df['B'], np.nan).rolling(window=4, min_periods=1).max()
In [38]: df
Out[38]:
A B C
0 0 1 0.0
1 1 2 1.0
2 2 3 2.0
3 10 4 2.0
4 4 5 4.0
5 5 6 5.0
6 10 7 5.0
7 10 8 5.0
8 10 9 5.0
9 10 10 NaN
I am trying to add an empty column after the 3ed column on my data frame that contains 5 columns. Example:
Fname,Lname,city,state,zip
mike,smith,new york,ny,11101
This is what I have and below I am going to show what I want it to look like.
Fname,Lname,new column,city,state,zip
mike,smith,,new york,ny,11101
I dont want to populate that column with data all I want to do is add the extra column in the header and that data will have the blank column aka ',,'.
Ive seen examples where a new column is added to the end of a data frame but not at a specific placement.
you should use
df.insert(loc, column, value)
with loc being the index and column the column name and value it's value
for an empty column
df.insert(loc=2, column='new col', value=['' for i in range(df.shape[0])])
Use reindex or column filtering
df = pd.DataFrame(np.arange(50).reshape(10,-1), columns=[*'ABCDE'])
df['z']= np.nan
df[['A','z','B','C','D','E']]
OR
df.reindex(['A','z','B','C','D','E'], axis=1)
Output:
A z B C D E
0 0 NaN 1 2 3 4
1 5 NaN 6 7 8 9
2 10 NaN 11 12 13 14
3 15 NaN 16 17 18 19
4 20 NaN 21 22 23 24
5 25 NaN 26 27 28 29
6 30 NaN 31 32 33 34
7 35 NaN 36 37 38 39
8 40 NaN 41 42 43 44
9 45 NaN 46 47 48 49
You can simply go for df.insert()
import pandas as pd
data = {'Fname': ['mike'],
'Lname': ['smith'],
'city': ['new york'],
'state': ['ny'],
'zip': [11101]}
df = pd.DataFrame(data)
df.insert(1, "Address", '', True)
print(df)
Output:
Fname Address Lname city state zip
0 mike smith new york ny 11101
I like to add a total row on the top of my pivot, but when I trie to concat I got an error.
>>> table
Weight
Vcountry 1 2 3 4 5 6 7
V20001
1 86 NaN NaN NaN NaN NaN 92
2 41 NaN 71 40 50 51 49
3 NaN 61 60 61 60 25 62
4 51 NaN NaN NaN NaN NaN NaN
5 26 26 20 41 25 23 NaN
[5 rows x 7 columns]
Thats the pivot Table
>>> totals_frame
Vcountry 1 2 3 4 5 6 7
totalCount 204 87 151 142 135 99 203
The total of it I like to join
[1 rows x 7 columns]
>>> pc = [totals_frame, table]
>>> concat(pc)
Here the output:
reindex_items
copy_if_needed=True)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2887, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2486, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "inference.pyx", line 915, in pandas.lib.tuples_to_object_array (pandas\lib.c:43656)
TypeError: object of type 'long' has no len()
Here's a possible way: instead of using pd.concat use pd.DataFrame.append. There's a bit of fiddling around with the index to do, but it's still quite neat I think:
# Just setting up the dataframe:
df = pd.DataFrame({'country':['A','A','A','B','B','B'],
'weight':[1,2,3,1,2,3],
'value':[10,20,30,15,25,35]})
df = df.set_index(['country','weight']).unstack('weight')
# A bit of messing about to get the index right:
index = df.index.values.tolist()
index.append('Totals')
# Here's where the magic happens:
df = df.append(df.sum(), ignore_index=True)
df.index = index
which gives:
value
weight 1 2 3
A 10 20 30
B 15 25 35
Totals 25 45 65