Columns: [TERMID, NAME, TYP, NAMECHANGE, ALIASES, SUCHREGELN, ZUW, SEOTEXT1, SEOTEXT2, SEOKommentar, DBIKommentar]
This is my empty dataframe.
one two three four
0 1.0 4.0 2.4 6.4
1 2.0 3.0 4.4 4.1
2 3.0 2.0 7.0 1.0
3 4.0 1.0 9.0 5.0
I need to fill these values into my empty dataframe.
So lets say'TERMID' takes the value from 'one', 'TYP' the value of 'two', 'ZUW' the value from 'three' and last but not least 'SEOKommentar' takes the value from 'four'
The empty dataframe needs to get filled row by row, and the ones which are not filled should say NaN.
How can I do this in an accurate way?
IIUC, you can rename the second dataframe and then reindex the columns to the original empty dataframe columns:
Creating the empty data frame:
s = 'TERMID,NAME,TYP,NAMECHANGE,ALIASES,SUCHREGELN,ZUW,SEOTEXT1,SEOTEXT2,SEOKommentar,DBIKommentar'
df = pd.DataFrame(columns=s.split(','))
Empty DataFrame
Columns: [TERMID, NAME, TYP, NAMECHANGE, ALIASES, SUCHREGELN, ZUW, SEOTEXT1, SEOTEXT2, SEOKommentar, DBIKommentar]
Index: []
Solution (df1 is the second dataframe in your example):
d = {'one': 'TERMID', 'two': 'TYP', 'three': 'ZUW', 'four': 'SEOKommentar'}
df = df1.rename(columns=d).reindex(columns=df.columns)
TERMID NAME TYP NAMECHANGE ALIASES SUCHREGELN ZUW SEOTEXT1 \
0 1.0 NaN 4.0 NaN NaN NaN 2.4 NaN
1 2.0 NaN 3.0 NaN NaN NaN 4.4 NaN
2 3.0 NaN 2.0 NaN NaN NaN 7.0 NaN
3 4.0 NaN 1.0 NaN NaN NaN 9.0 NaN
SEOTEXT2 SEOKommentar DBIKommentar
0 NaN 6.4 NaN
1 NaN 4.1 NaN
2 NaN 1.0 NaN
3 NaN 5.0 NaN
Related
I'm trying to pivot a dataframe in pandas to produce heatmaps (pandas version 1.4.3). The issue is that after pivoting, the original sorting of the index column is lost. Since my data represent samples from geographical locations, I need them to be sorted by latitude (which is how they are in the 'TILE' column in the example below).
mwe:
dummy = [{'TILE':'N59TE010A','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'N59TE009G','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'N59RE009G','METRIC':'ELD_RMSE','LOW':1},
{'TILE':'N59TE010B','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'N59TE010C','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'S24TW047F','METRIC':'RUF_RMSE','LOW':2},
{'TILE':'S24TW047G','METRIC':'ELD_LE90','LOW':2},
{'TILE':'S24MW047D','METRIC':'SMD_LE90','LOW':2},
{'TILE':'S24MW047C','METRIC':'RUF_RMSE','LOW':0},
{'TILE':'S24MW047D','METRIC':'RUF_RMSE','LOW':0}]
df = pd.DataFrame.from_dict(dummy)
df
TILE METRIC LOW
0 N59TE010A ELD_RMSE 2
1 N59TE009G ELD_RMSE 2
2 N59RE009G ELD_RMSE 1
3 N59TE010B ELD_RMSE 2
4 N59TE010C ELD_RMSE 2
5 S24TW047F RUF_RMSE 2
6 S24TW047G ELD_LE90 2
7 S24MW047D SMD_LE90 2
8 S24MW047C RUF_RMSE 0
9 S24MW047D RUF_RMSE 0
df.pivot(index='TILE', columns='METRIC', values='LOW')
METRIC ELD_LE90 ELD_RMSE RUF_RMSE SMD_LE90
TILE
N59RE009G NaN 1.0 NaN NaN
N59TE009G NaN 2.0 NaN NaN
N59TE010A NaN 2.0 NaN NaN
N59TE010B NaN 2.0 NaN NaN
N59TE010C NaN 2.0 NaN NaN
S24MW047C NaN NaN 0.0 NaN
S24MW047D NaN NaN 0.0 2.0
S24TW047F NaN NaN 2.0 NaN
S24TW047G 2.0 NaN NaN NaN
Never mind the NaN values, the point is that the first row should have tile N59TE010A, and not N59RE009G (and so on).
I've been trying a few solutions I found here and elsewhere but without luck. Is there a way to preserve the sorting of the 'TILE' column?
Thanks
You can use pivot_table that has more options, including sort=False:
df.pivot_table(index='TILE', columns='METRIC', values='LOW', sort=False)
Another option could be to add a dummy column to use as index with the desired order, using for example pandas.factorize to keep the original order.
(df.assign(idx=pd.factorize(df['TILE'])[0])
.pivot(index=['idx', 'TILE'], columns='METRIC', values='LOW')
.droplevel('idx')
)
output:
METRIC ELD_LE90 ELD_RMSE RUF_RMSE SMD_LE90
TILE
N59TE010A NaN 2.0 NaN NaN
N59TE009G NaN 2.0 NaN NaN
N59RE009G NaN 1.0 NaN NaN
N59TE010B NaN 2.0 NaN NaN
N59TE010C NaN 2.0 NaN NaN
S24TW047F NaN NaN 2.0 NaN
S24TW047G 2.0 NaN NaN NaN
S24MW047D NaN NaN 0.0 2.0
S24MW047C NaN NaN 0.0 NaN
I need to populate NaN values of my df by a static 0, starting from the first non-nan value.
In a way, combining method="ffill" (identify the first value per column, and only act on following NaN values) with value=0 (populating by 0, not the variable quantity in df).
How can I do that? This post is close, but not it: How to replace NaNs by preceding or next values in pandas DataFrame?
Example df
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 NaN 3.0 NaN
3 NaN NaN 4.0
Desired output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 0.0
3 0.0 0.0 4.0
If possible, df.fillna(value=0, method='ffill') would be great. But that returns ValueError: Cannot specify both 'value' and 'method'.
Edit: Oh, and time matters. We are talking ~60M rows and 4k columns - so looping is out of the question, and masking only if really, really fast
You can try mask(), ffill() and fillna():
df=df.fillna(df.mask(df.ffill().notna(),0))
#OR via where
df=df.fillna(df.where(df.ffill().isna(),0))
output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 4.0
3 0.0 0.0 0.0
I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!
You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0
I am not looking for merging/concatenating columns or replacing some values with other values (although...maybe yes?). But I have a large dataframe (>100 rows and columns) and I would like to extract columns that are "almost identical", i.e. that have >2 values (at the same index) in common and not no different values at other indexes (if there is a value in one column, there must be either the same value or a NaN in the other).
Here is an example of such a dataframe:
a = np.random.randint(1,10,10)
b = np.array([np.nan,2,np.nan,3,np.nan,6,8,1,2,np.nan])
c = np.random.randint(1,10,10)
d = np.array([7,2,np.nan,np.nan,np.nan,6,8,np.nan,2,2])
e = np.array([np.nan,2,np.nan,np.nan,np.nan,6,8,np.nan,np.nan,2])
f = np.array([np.nan,2,np.nan,3.0,7,np.nan,8,np.nan,np.nan,2])
df = pd.DataFrame({'A':a,'B':b,'C':c,'D':d,'E':e, 'F':f})
df.ix[3:6,'A']=np.nan
df.ix[4:8,'C']=np.nan
EDIT
keys=['S01_o4584','S02_o2531','S03_o7812','S03_o1122','S04_o5210','S04_o3212','S05_o4665','S06_o7425','S07_o3689','S08_o2371']
df['index']=keys
df = df.set_index('index')
A B C D E F
index
S01_o4584 8.0 NaN 9.0 7.0 NaN NaN
S02_o2531 8.0 2.0 5.0 2.0 2.0 2.0
S03_o7812 1.0 NaN 5.0 NaN NaN NaN
S03_o1122 NaN 3.0 6.0 NaN NaN 3.0
S04_o5210 NaN NaN NaN NaN NaN 7.0
S04_o3212 NaN 6.0 NaN 6.0 6.0 NaN
S05_o4665 NaN 8.0 NaN 8.0 8.0 8.0
S06_o7425 1.0 1.0 NaN NaN NaN NaN
S07_o3689 8.0 2.0 NaN 2.0 NaN NaN
S08_o2371 3.0 NaN 9.0 2.0 2.0 2.0
As you see, columns B, D (and newly E) have identical values at locations (indexes) S02_o2531,S04_o3212,S05_o4665 and S08_o2371, whereas at other location, one has a value while the other has s NaN.
My desired output would be:
index BD*E*
S01_o4584 7
S02_o2531 2
S03_o7812 NaN
S03_o1122 3
S04_o5210 NaN
S04_o3212 6
S05_o4665 8
S06_o7425 1
S07_o3689 2
S08_o2371 2
However, I can't combine columns that would then have two different values for the same beginning of the index: as you can see, column F also shares some of the indexes, but a new one is at S04_o5210, but the previous combined columns already have a value at "S04_" (index S04_o3212).
Is there a reasonably pythonic way to do it? I.e. 1) find the columns based on the condition that the values in them must be either identical or np.nan, not different. 2) set a condition that a column cannot be combined if it has the same beginning of the index of previously included values (I may probably need to split the string into two columns and do multiindex???) 3) combine them into the new Series/DataFrame.
def almost(df):
i, j = np.triu_indices(len(df.columns), 1)
v = df.values
d = v[:, i] - v[:, j]
m = (np.where(np.isnan(d), 0, d) == 0).all(0)
return pd.concat(
[
df.iloc[:, i_].combine_first(
df.iloc[:, j_]
).rename(
tuple(df.columns[[i_, j_]])
) for i_, j_ in zip(i[m], j[m])],
axis=1
)
almost(df)
B
D
0 7.0
1 2.0
2 NaN
3 3.0
4 NaN
5 6.0
6 8.0
7 1.0
8 2.0
9 2.0
how it works
i and j represent every combination of columns using numpy to get the indices of an upper triangle.
slice the underlying numpy array df.values with i and j and subtract them. Where the differences are nan, means one or the other were nan. Otherwise, difference should be zero if respective elements are the same.
since we can tolerate nan in one or the other, fill them with zero using np.where.
find where all rows are zero with (x == 0).all(0).
use the mask above to slice i and j and identify the columns that were matches.
build a dataframe of all matches with a pd.MultiIndex for columns that show what matches what.
cooler example
np.random.seed([3,1415])
m, n = 20, 26
df = pd.DataFrame(
np.random.randint(10, size=(m, n)),
columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
).mask(np.random.choice([True, False], (m, n), p=(.6, .4)))
df
almost(df)
A D G H I J K
J X K M N J K V S X
0 6.0 7.0 3.0 NaN 4.0 6.0 NaN 6.0 NaN 7.0
1 3.0 3.0 2.0 6.0 4.0 NaN 2.0 6.0 2.0 2.0
2 3.0 0.0 NaN 2.0 4.0 3.0 NaN 3.0 4.0 0.0
3 4.0 4.0 3.0 5.0 5.0 4.0 3.0 4.0 3.0 3.0
4 7.0 NaN NaN 7.0 3.0 7.0 NaN 7.0 NaN NaN
5 NaN NaN 2.0 0.0 5.0 NaN 2.0 2.0 2.0 2.0
6 NaN 8.0 NaN NaN 9.0 2.0 2.0 1.0 NaN 8.0
7 NaN 7.0 NaN 9.0 9.0 6.0 6.0 NaN NaN 7.0
8 NaN NaN 8.0 3.0 1.0 NaN NaN NaN 4.0 NaN
9 0.0 0.0 8.0 2.0 NaN 3.0 3.0 NaN NaN NaN
10 0.0 0.0 NaN 6.0 1.0 NaN NaN 8.0 NaN NaN
11 NaN NaN 3.0 NaN 9.0 3.0 3.0 NaN 3.0 3.0
12 5.0 NaN NaN NaN 6.0 5.0 NaN 5.0 8.0 NaN
13 NaN NaN NaN NaN 7.0 5.0 5.0 NaN NaN NaN
14 NaN NaN 6.0 4.0 8.0 8.0 8.0 NaN 0.0 NaN
15 8.0 8.0 7.0 NaN NaN NaN NaN NaN 2.0 NaN
16 4.0 4.0 4.0 4.0 9.0 9.0 9.0 6.0 4.0 NaN
17 NaN 4.0 NaN 4.0 2.0 8.0 8.0 4.0 NaN 4.0
18 NaN NaN 2.0 7.0 NaN NaN NaN NaN NaN NaN
19 NaN 7.0 6.0 3.0 5.0 NaN NaN 7.0 NaN 7.0
It sounds like the sticking point is how to detect "almost identical" columns, which are columns that only differ (if at all) in what values are missing. Given two column names, how do you check if they are almost identical? Note that if we find a difference that counts, it must be at an index for which neither column has NaN. In other words, the trick is to discard rows with a missing value and compare the rest:
tocheck = df[["B", "D"]].dropna()
if all(tocheck.B == tocheck.D):
print("B, D are almost identical")
Let's use this to iterate over all pairs of columns, and merge the ones that match:
for a, b in itertools.combinations(df.columns, 2):
if a not in df.columns or b not in df.columns: # Was one deleted already?
continue
tocheck = df[[a, b]].dropna()
if all(tocheck[a] == tocheck[b]):
print(b, "->", a)
df[a] = df[a].combine_first(df[b])
del df[b]
Note (in case you haven't noticed) that when multiple columns end up being merged, it's possible to have order-dependent behavior. For example:
A B C
0 NaN 1 2
1 10 NaN NaN
Here you could either merge B or C into A, but not both. Such problems aside, multiple columns can be merged into one since the merged column is saved in place of one of the compared columns.
et voila
test = df.B == df.D
df.loc[test,'myunion'] = df.loc[test, 'B']
df.loc[!test ,'myunion'] = df.loc[!test, 'B'].fillna(0) + df.loc[!test, 'D'].fillna(0)
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.