I have created a DataFrame full of zeros such as:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
...
n 0 0 0
I have a list containing names for the column in unicode, such as:
list = [u'One', u'Two', u'Three']
The DataFrame of zeroes is known as a, and I am creating a new complete DataFrame with the list as column headers via:
final = pd.DataFrame(a, columns=[list])
However, the resulting DataFrame has column names that are no longer unicode (i.e. they do not show the u'' tag).
I am wondering why this is happening. Thanks!
There is no reason for lost unicode, you can check it by:
print df.columns.tolist()
Please never use reserved words like list, type, id... as variables because masking built-in functions. Also is necessary add values for convert values to numpy array:
a = pd.DataFrame(0, columns=range(3), index=range(3))
print (a)
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(a.values, columns=L)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
because columns are not aligned and get all NaNs:
final = pd.DataFrame(a, columns=L)
print (final)
One Two Three
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
I think simpliest is use only index of a DataFrame if all values are 0:
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(0, columns=L, index=a.index)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
Related
I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))
I am trying to carrying out calculations on individual values that are stored in a nested list stored in a pandas DataFrame. My issue is on how to access these individual values.
I am working from a data set available here: https://datadryad.org/stash/dataset/doi:10.5061/dryad.h505v
I have imported the .json file in a pandas DataFrame and the elastic constants are stored in the column 'elastic_tensor'.
import pandas as pd
df = pd.read_json(workdir+"ec.json")
df['elastic_tensor'].head()
Out:
0 [[311.33514638650246, 144.45092552856926, 126....
1 [[306.93357350984974, 88.02634955100905, 105.6...
2 [[569.5291276937579, 157.8517489654999, 157.85...
3 [[69.28798774976904, 34.7875015216915, 37.3877...
4 [[349.3767766177825, 186.67131003104407, 176.4...
Name: elastic_tensor, dtype: object
In order to access the individual values, what I have done is expand the nested lists once (as I could not find a way to use .extend() to flatten the nested list):
df1 = pd.DataFrame(df["elastic_tensor"].to_list() , columns=['c'+str(j) for j in range(1,7)])
Note: I have named the columns c1..c6 as the elastic constants in the
end shall be called cij with i and j from 1 to 6.
Then I have expanded each of these columns in turns (as I could not find the way to do a loop):
dfc1 = pd.DataFrame(df1["c1"].to_list() , columns=['c1'+str(j) for j in range(1,7)])
dfc2 = pd.DataFrame(df1["c2"].to_list() , columns=['c2'+str(j) for j in range(1,7)])
dfc3 = pd.DataFrame(df1["c3"].to_list() , columns=['c3'+str(j) for j in range(1,7)])
dfc4 = pd.DataFrame(df1["c4"].to_list() , columns=['c4'+str(j) for j in range(1,7)])
dfc5 = pd.DataFrame(df1["c5"].to_list() , columns=['c5'+str(j) for j in range(1,7)])
dfc6 = pd.DataFrame(df1["c6"].to_list() , columns=['c6'+str(j) for j in range(1,7)])
before merging them
data_frames = [dfc1, dfc2, dfc3, dfc4, dfc5, dfc6]
df_merged = pd.DataFrame().join(data_frames, how="outer")
which gives me a DataFrame with columns containing the individual cij values:
https://i.stack.imgur.com/odraQ.png
I can now carry out arithmetic operations on these individual values and add a column in the initial "df" dataframe with the results, but there must be a better way of doing it (especially if the matrices are large). Any idea?
approach using apply(pd.Series) to expand a list into columns
using stack() and unstack() generate multi-index columns that are zero-indexed values into 2D list
flatten multi-index to match your stated requirement (one-indexed instead of zero indexed)
import json
from pathlib import Path
# file downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.h505v
with open(Path.cwd().joinpath("ec.json")) as f: js = json.load(f)
df = pd.json_normalize(js)
# expand first dimension, put it into row index, expand second dimension, make multi-index columns
dfet = df["elastic_tensor"].apply(pd.Series).stack().apply(pd.Series).unstack()
# flatten multi-index columns, index from 1, instead of standard 0
dfet.columns = [f"c{i+1}{j+1}" for i,j in dfet.columns.to_flat_index()]
head(5)
c11
c12
c13
c14
c15
c16
c21
c22
c23
c24
c25
c26
c31
c32
c33
c34
c35
c36
c41
c42
c43
c44
c45
c46
c51
c52
c53
c54
c55
c56
c61
c62
c63
c64
c65
c66
0
311.335
144.451
126.176
0
-0.110347
0
144.451
311.32
126.169
0
-0.112161
0
126.176
126.169
332.185
0
-0.107541
0
0
0
0
98.9182
0
0
-0.110347
-0.112161
-0.107541
0
98.921
0
0
0
0
0
0
103.339
1
306.934
88.0263
105.696
2.53622
-0.568262
-0.188934
88.0263
298.869
101.79
-1.43474
-0.608261
-0.226253
105.696
101.79
398.441
0.350166
-0.577829
-0.232358
2.53622
-1.43474
0.350166
75.3104
0
0
-0.568262
-0.608261
-0.577829
0
75.5826
1.92806
-0.188934
-0.226253
-0.232358
0
1.92806
105.685
2
569.529
157.852
157.851
0
0
0
157.852
569.53
157.852
0
0
0
157.851
157.852
569.53
0
0
0
0
0
0
94.8801
0
0
0
0
0
0
94.88
0
0
0
0
0
0
94.8801
3
69.288
34.7875
37.3877
0
0
0
34.7875
78.1379
40.6047
0
0
0
37.3877
40.6047
70.1326
0
0
0
0
0
0
19.8954
0
0
0
0
0
0
4.75803
0
0
0
0
0
0
30.4095
4
349.377
186.671
176.476
0
0
0
186.671
415.51
213.834
0
0
0
176.476
213.834
407.479
0
0
0
0
0
0
120.112
0
0
0
0
0
0
125.443
0
0
0
0
0
0
74.9078
numpy approach
a = np.dstack(df["elastic_tensor"])
pd.DataFrame(a.reshape((a.shape[0]*a.shape[1], a.shape[2])).T,
columns=[f"c{i+1}{j+1}" for i in range(a.shape[0]) for j in range(a.shape[1])])
I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p
I would like to find the mode value of each digit in binary strings of a pandas column. Suppose I have the following data
df = pd.DataFrame({'categories':['A','B','C'],'values':['001','110','111']})
so my data look like this
categories values
A 001
B 110
C 111
If we consider the column "values" at the first digit (0, 1, 1) of A, B, and C respectively, the mode value is 1. If we do the same for other digits, my expected output should be 111.
I can find a mode value of a particular column. If I split each bit into a new column and find the mode value. I could get the expected output by concatenation later. However, when the data has much more columns of binary strings, I'm not sure whether this method still be a good way to do. I'm looking for the more elegant method do this. May I have your suggestion?
I think you can use apply with Series and list for convert digits to columns and then mode:
print (df['values'].apply(lambda x: pd.Series(list(x))))
0 1 2
0 0 0 1
1 1 1 0
2 1 1 1
df1 = df['values'].apply(lambda x: pd.Series(list(x))).mode()
print (df1)
0 1 2
0 1 1 1
Last select row, create list and join:
print (''.join(df1.iloc[0].tolist()))
111
Another possible solution with list comprehension:
df = pd.DataFrame([list(x) for x in df['values']])
print (df)
0 1 2
0 0 0 1
1 1 1 0
2 1 1 1
If output is DataFrame is possible use apply join:
df = pd.DataFrame({'categories':['A','B','C', 'D'],'values':['001','110','111', '000']})
print (df)
categories values
0 A 001
1 B 110
2 C 111
3 D 000
print (pd.DataFrame([list(x) for x in df['values']]).mode())
0 1 2
0 0 0 0
1 1 1 1
df1 = pd.DataFrame([list(x) for x in df['values']]).mode().apply(''.join, axis=1)
print (df1)
0 000
1 111
dtype: object
I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0