Positional indexing with NA values - python

I need to index the dataframe from positional index, but I got NA values in previous operation and I wanna preserve it. How could I achieve this?
df1
NaN
1
NaN
NaN
NaN
6
df2
0 10
1 15
2 13
3 15
4 16
5 17
6 17
7 18
8 10
df3
0 15
1 17
The output I want
NaN
15
NaN
NaN
NaN
17
df2.iloc(df1)
IndexError: indices are out-of-bounds
.iloc method in this case drive to a unbound error, I think .iloc is not available here. df3 is another output generated by .loc, but I don't know how to add NaN between them. If you can achieve output by using df1 and df3 is also ok

If df1 and df2 has same index values use for replace non missing values by values from another DataFrame DataFrame.mask with DataFrame.isna:
df1 = df2.mask(df1.isna())
print (df1)
col
0 NaN
1 15.0
2 NaN
3 NaN
4 NaN
5 17.0

Related

Pandas new column replace only show specific pattern value in new column

Index value
1 880770000-t-ptt-018-108
2 Nan
3 760770000-t-ptm-001-107
4 Date
5 11/20/2020
6 607722991-t-ptr-001-888
7 NaN
8 Date
9 10/25/2020
10 12/30/2019
11 967722944-t-ptq-020-888
I want this in next column specific pattern values to be only shown in new column in same dataframe and other values to be replace by NaN like this. the original table has 200k rows and 22 columns the pattern has above 5000 combinations.
Index value
1 880770000-t-ptt-018-108
2 Nan
3 760770000-t-ptm-001-107
4 NaN
5 Nan
6 607722991-t-ptr-001-888
7 NaN
8 NaN
9 NaN
10 NaN
11 967722944-t-ptq-020-888
df['value'] = df['value'].apply(lambda x: x if "-t-" in x else np.NaN)

How fill unstinting numeric values in df column

so I am trying to add rows to data frame that should follow a numeric order 1 to 52
but my data is missing numbers, so I need to add these rows and fill these spots with NaN values or null.
df = pd.DataFrame("Weeks": [1,2,3,15,16,20,21,52],
"Values": [10,10,10,10,50,60,70,40])
Desired output:
Weeks Values
1 10
2 10
3 10
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
...
52 40
and so on until it reach Weeks = 52
My solution:
new_df = pd.DataFrame("Weeks": "" , "Values":"")
for x in range(1,53):
for i in df.Weeks:
if x == i:
new_df["Weeks"] = x
new_df["Values"] = df.Values[i]
The problem it is super inefficient, anyone know a way to do it in much efficient way?
You could use set_index to set the Weeks as index an reindex with a range up to the maximum week:
df.set_index('Weeks').reindex(range(1,df.Weeks.max()))
Or accounting for the minimum week too:
df.set_index('Weeks').reindex(range(*df.Weeks.agg(('min', 'max'))))
Values
Weeks
1 10.0
2 10.0
3 10.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 10.0
16 50.0
17 NaN
...

Setting the index after merging with pandas?

Executing the following merge
import pandas as pd
s = pd.Series(range(5, 10), index=range(10, 15), name='score')
df = pd.DataFrame({'id': (11, 13), 'value': ('a', 'b')})
pd.merge(s, df, 'left', left_index=True, right_on='id')
results in this data frame:
score id value
NaN 5 10 NaN
0.0 6 11 a
NaN 7 12 NaN
1.0 8 13 b
NaN 9 14 NaN
Why does Pandas take the index from the right data frame as the index for the result, instead of the index from the left series, even though I specified both a left merge and left_index=True? The documentation says
left: use only keys from left frame
which I interpreted differently from the result I am actually getting. What I expected was the following data frame.
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
I am using Python 3.7.5 with Pandas 0.25.3.
Here's what happens:
the output index is the intersection of the index/column merge keys [0, 1].
missing keys are replaced with NaN
NaNs result in the index type being upcasted to float
To set the index, just assign to it:
s2 = pd.merge(s, df, how='left', left_index=True, right_on='id')
s2.index = s.index
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
You can also merge on s (just because I dislike calling pd.merge directly):
(s.to_frame()
.merge(df, how='left', left_index=True, right_on='id')
.set_axis(s.index, axis=0, inplace=False))
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
You can do this with reset_index:
df = pd.merge(s,df, 'left', left_index=True, right_on='id').reset_index(drop=True).set_index('id').rename_axis(index=None)
df.insert(1, 'id', df.index)
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
Since I do not need the duplicated information in both the id column and the index, I went with a combination of the answers from cs95 and oppressionslayer, and did the following:
pd.merge(s, df, 'left', left_index=True, right_on='id').set_index('id')
Which results in this data frame:
score value
id
10 5 NaN
11 6 a
12 7 NaN
13 8 b
14 9 NaN
Since this is different from what I initially asked for, I am leaving the answer from cs95 as the accepted answer, but I think this use case needs to be documented as well.

How to split column from DataFrame with Pandas

I am reading a CSV file from an API call into a data frame with pandas for some data manipulation.
Currently, I'm getting this response:
n [78]: dfname
Out[78]:
productID amountInStock index index_col
7 1.0 NaN 1 7
19 4.0 NaN 2 19
20 1.0 NaN 3 20
22 2.0 NaN 4 22
I then call dfname.reset_index() to create a better index:
dfname.reset_index()
Out[80]:
level_0 productID amountInStock index index_col
0 7 1.0 NaN 1 7
1 19 4.0 NaN 2 19
2 20 1.0 NaN 3 20
3 22 2.0 NaN 4 22
But the problem is that the 'productID' series has two columns and I can't work out how to split them!
dfname.productID
Out[82]:
7 1.0
19 4.0
20 1.0
22 2.0
What I want is dfname.productID to return:
dfname.productID
Out[82]:
7
19
20
22
and the other figures currently in productID should be assigned to 'stockqty'.
How do I split this field so that it returns two columns instead of one? I've tried .str.split() to no avail.
The properties of the object are Name: productID, Length: 2102, dtype: float64
But the problem is that the 'productID' series has two columns and I
can't work out how to split them!
Therein lies the misunderstanding. You don't have 2 columns, despite what print tells you. You have one column with an index. This is precisely how a pd.Series object is defined.
What I want is dfname.productID to return:
As above, this isn't possible. Every series has an index. This is non-negotiable.
How do I split this field so that it returns two columns instead of
one? I've tried .str.split() to no avail.
This isn't the way forward. In particular, note pd.Series.str.split is for splitting strings within series. You don't have strings here. Instead, use reset_index and rename your column. Or name your index before reset_index. The latter option seems cleaner to me:
df.index.name = 'stockqty'
df = df.reset_index()
print(df)
stockqty productID amountInStock index index_col
0 7 1.0 NaN 1 7
1 19 4.0 NaN 2 19
2 20 1.0 NaN 3 20
3 22 2.0 NaN 4 22
I resolved by specifying the separator when parsing the csv:
df = pd.read_csv(link, encoding='ISO-8859-1', sep=', ', engine='python')

Best way to eliminate columns with only one value from pandas dataframe

i'm trying to build a function to eliminate from my dataset the columns with only one value. I used this function:
def oneCatElimination(dataframe):
columns=dataframe.columns.values
for column in columns:
if len(dataframe[column].value_counts().unique())==1:
del dataframe[column]
return dataframe
the problem is that the function eliminates even column with more the one distinct value, i.e. a index column with integer number..
Just
df.dropna(thresh=2, axis=1)
will work. No need for anything else. It will keep all columns with 2 or more non-NA values (controlled by the value passed to thresh). The axis kwarg will let you work with rows or columns. It is rows by default, so you need to pass axis=1 explicitly to work on columns (I forgot this at the time I answered, hence this edit). See dropna() for more information.
A couple of assumptions went into this:
Null/NA values don't count
You need multiple non-NA values to keep a column
Those values need to be different in some way (e.g., a column full of 1's and only 1's should be dropped)
All that said, I would use a select statement on the columns.
If you start with this dataframe:
import pandas
N = 15
df = pandas.DataFrame(index=range(10), columns=list('ABCD'))
df.loc[2, 'A'] = 23
df.loc[3, 'B'] = 52
df.loc[4, 'B'] = 36
df.loc[5, 'C'] = 11
df.loc[6, 'C'] = 11
df.loc[7, 'D'] = 43
df.loc[8, 'D'] = 63
df.loc[9, 'D'] = 97
df
Which creates:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 23 NaN NaN NaN
3 NaN 52 NaN NaN
4 NaN 36 NaN NaN
5 NaN NaN 11 NaN
6 NaN NaN 11 NaN
7 NaN NaN NaN 43
8 NaN NaN NaN 63
9 NaN NaN NaN 97
Given my assumptions above, columns A and C should be dropped since A only has one value and both of C's values are the same. You can then do:
df.select(lambda c: df[c].dropna().unique().shape[0] > 1, axis=1)
And that gives me:
B D
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 52 NaN
4 36 NaN
5 NaN NaN
6 NaN NaN
7 NaN 43
8 NaN 63
9 NaN 97
This will work for both text and numbers:
for col in dataframe:
if(len(dataframe.loc[:,col].unique()) == 1):
dataframe.pop(col)
Note: This will remove the columns having only one value from the original dataframe.

Categories