How to split column from DataFrame with Pandas - python

I am reading a CSV file from an API call into a data frame with pandas for some data manipulation.
Currently, I'm getting this response:
n [78]: dfname
Out[78]:
productID amountInStock index index_col
7 1.0 NaN 1 7
19 4.0 NaN 2 19
20 1.0 NaN 3 20
22 2.0 NaN 4 22
I then call dfname.reset_index() to create a better index:
dfname.reset_index()
Out[80]:
level_0 productID amountInStock index index_col
0 7 1.0 NaN 1 7
1 19 4.0 NaN 2 19
2 20 1.0 NaN 3 20
3 22 2.0 NaN 4 22
But the problem is that the 'productID' series has two columns and I can't work out how to split them!
dfname.productID
Out[82]:
7 1.0
19 4.0
20 1.0
22 2.0
What I want is dfname.productID to return:
dfname.productID
Out[82]:
7
19
20
22
and the other figures currently in productID should be assigned to 'stockqty'.
How do I split this field so that it returns two columns instead of one? I've tried .str.split() to no avail.
The properties of the object are Name: productID, Length: 2102, dtype: float64

But the problem is that the 'productID' series has two columns and I
can't work out how to split them!
Therein lies the misunderstanding. You don't have 2 columns, despite what print tells you. You have one column with an index. This is precisely how a pd.Series object is defined.
What I want is dfname.productID to return:
As above, this isn't possible. Every series has an index. This is non-negotiable.
How do I split this field so that it returns two columns instead of
one? I've tried .str.split() to no avail.
This isn't the way forward. In particular, note pd.Series.str.split is for splitting strings within series. You don't have strings here. Instead, use reset_index and rename your column. Or name your index before reset_index. The latter option seems cleaner to me:
df.index.name = 'stockqty'
df = df.reset_index()
print(df)
stockqty productID amountInStock index index_col
0 7 1.0 NaN 1 7
1 19 4.0 NaN 2 19
2 20 1.0 NaN 3 20
3 22 2.0 NaN 4 22

I resolved by specifying the separator when parsing the csv:
df = pd.read_csv(link, encoding='ISO-8859-1', sep=', ', engine='python')

Related

Positional indexing with NA values

I need to index the dataframe from positional index, but I got NA values in previous operation and I wanna preserve it. How could I achieve this?
df1
NaN
1
NaN
NaN
NaN
6
df2
0 10
1 15
2 13
3 15
4 16
5 17
6 17
7 18
8 10
df3
0 15
1 17
The output I want
NaN
15
NaN
NaN
NaN
17
df2.iloc(df1)
IndexError: indices are out-of-bounds
.iloc method in this case drive to a unbound error, I think .iloc is not available here. df3 is another output generated by .loc, but I don't know how to add NaN between them. If you can achieve output by using df1 and df3 is also ok
If df1 and df2 has same index values use for replace non missing values by values from another DataFrame DataFrame.mask with DataFrame.isna:
df1 = df2.mask(df1.isna())
print (df1)
col
0 NaN
1 15.0
2 NaN
3 NaN
4 NaN
5 17.0

Check if values in one dataframe match values from another, updating dataframe

Let's say I have 2 dataframes,
both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'],
'population': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'],
'population': [44,12,23,64]})
Lets assume that some of the data in df1 is outdated and I've received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country are inside df1.country
By doing the following I'm able to return a boolean:
df = df1.country.isin(df2.country)
print(df)
Unfortunately I'm just creating a new dataframe containing the answer to my question
0 True
1 False
2 True
3 True
4 False
5 False
Name: country, dtype: bool
My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.
I've manage to come up with something like this:
df = df1.country.isin(df2.country)
i = 0
for x in df:
if x:
df1.drop(i, inplace=True)
i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)
which in fact works and updates the dataframe
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
But I really believe there's a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I'd love to hear some suggestions, Thanks!
Assuming col1 remains unique in the original dataframe, you can join the two tables together. Once you have them in the same dataframe, you can apply your logic i.e. update value from new dataframe if it is not null. You actually don't need to check if col2 has changed for every entry in col1. You can just replace col2 value with col1 as long as it is not NaN (based on your sample output).
df1 = pd.DataFrame({'col1': ['a','f','r','g','d','s'], 'col2': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'col1': ['a','g','o','r'], 'col2': [44,12,23,64]})
# do the join
x= pd.merge(df1,df2,how='outer',
left_on="col1", right_on="col1")
col1 col2_x col2_y
0 a 41.0 44.0
1 f 12.0 NaN
2 r 26.0 64.0
3 g 64.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o NaN 23.0
# apply your update rules
x['col2_x'] = np.where(
~x['col2_y'].isnull(),
x['col2_y'],x['col2_x']
)
col1 col2_x col2_y
0 a 44.0 44.0
1 f 12.0 NaN
2 r 64.0 64.0
3 g 12.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o 23.0 23.0
#clean up
x.drop("col2_y", axis=1, inplace = True)
x.columns = ["col1", "col2"]
col1 col2
0 a 44.0
1 f 12.0
2 r 64.0
3 g 12.0
4 d 123.0
5 s 24.0
6 o 23.0
The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:
m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3:
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64

Efficiently combine dataframes on 2nd level index

I have two dataframes looking like
import pandas as pd
df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'
df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
('A','b',5),('A','b',6),('B','c',7),
('B','c',9),('B','d',10),('B','d',11),
], names=('big', 'small', 't')))
I am searching for an efficient way to combine them such that I get
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.
Of course a loop over the dataframe would work as well, though not feasible for large dataframes.
EDIT:
It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.
Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).
res = (df2.reset_index()
.merge(df1, on='t', how='outer')
.set_index(df2.index.names)
.sort_index(level='t'))
res.index = pd.MultiIndex.from_arrays(
[pd.Series(res.index.get_level_values(i)).ffill()
for i in range(res.index.nlevels)],
names=res.index.names)
print(res)
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
Try extracting the level values and reindex:
df2['0'] = df1.reindex(df2.index.get_level_values('t'))[0].values
Output:
0
big small t
A a 1 NaN
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
9 NaN
d 10 10.5
11 NaN
For more columns in df1, we can just merge:
(df2.reset_index()
.merge(df1, on='t', how='left')
.set_index(df2.index.names)
)

How to really filter a pandas dataset without leaving Nans everywhere

Say I have a huge DataFrame that only contains a handful of cells that match the filtering I perform. How can I end up with only the values that match it (and their indexes and columns) in a new dataframe without the entire other DataFrame that becomes Nan. Dropping Nan's with dropna just removes the whole column or row and filter replaces non matches with Nans.
Here's my code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((1000, 1000)))
# this one is almost filled with Nans
df[df<0.01]
If need non missing values in another format you can use DataFrame.stack:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
# this one is almost filled with Nans
df1 = df[df<7]
print (df1)
0 1 2
0 0.0 NaN 3.0
1 6.0 3.0 3.0
2 NaN NaN 0.0
3 0.0 NaN NaN
4 3.0 NaN 2.0
df2 = df1.stack().rename_axis(('a','b')).reset_index(name='c')
print (df2)
a b c
0 0 0 0.0
1 0 2 3.0
2 1 0 6.0
3 1 1 3.0
4 1 2 3.0
5 2 2 0.0
6 3 0 0.0
7 4 0 3.0
8 4 2 2.0

drop_duplicates() stopped working in Python pandas

This code had previously worked in python 3 to remove the duplicate values but keep first occurrence across an entire dataframe. After coming back to my script this no longer removes duplicates in a pandas dataFrame.
df = df.apply(lambda x: x.drop_duplicates(), axis=1)
so if I have
a b c
0 1 2
3 4 0
0 8 9
10 0 11
I want to get as an output
a b c
0 1 2
3 4
8 9
10 11
I don't mind if the blanks return as 'nan'
I also tried the following
df.drop_duplicates(subset = None, keep='first')
and
df.drop_duplicates(subset = None, keep='first', inplace =True)
Any advice / alternatives would be welcome!
After your attached the data , I think you can using duplicated
newdf=df[~df.stack().duplicated().unstack()]
newdf
Out[131]:
a b c
0 0.0 1.0 2.0
1 3.0 4.0 NaN
2 NaN 8.0 9.0
3 10.0 NaN 11.0
You need inplace to be True:
df.drop_duplicates(subset=None, keep='first', inplace=True)

Categories