I'm trying to fill two pandas columns with multiple lists of different sizes.
So for example I have the first column with a list like "angioplasty, aortic, artery" and the second one like "251, 2882, 401, 4019, 412"
First I tried to append each list like this:
matches.code_matches.append(code_series)
which resulted in this TypeError:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
So I tried converting the lists into a series and append them to the dataframe with this code:
code_series = pd.Series( (v[0] for v in code_matches))
matches.code_matches.append(code_series)
However after appending the series my Dataframe still ends up empty.
What is the best way to fill the columns of a Dataframe different sized lists? Note that i want to keep the lists and fill them into single fields instead of filling in each element separately (i want to assign the values to ids)
Use:
In [704]: l1 = ['angioplasty', 'aortic', 'artery']
In [705]: l2 = [251, 2882, 401, 4019, 412]
In [707]: df = pd.DataFrame({'a': pd.Series(l1), 'b': pd.Series(l2)})
In [708]: df
Out[708]:
a b
0 angioplasty 251
1 aortic 2882
2 artery 401
3 NaN 4019
4 NaN 412
I used the .at method in the code below and it seems that worked.
x = pd.DataFrame()
x['A'] = 1,2,3,4,5,6
x['B'] = 1.0,2,3,4,5,6
x['C'] = 9,8,7,6,5,4
x['D']=0
x['D']=x['D'].astype('object')
for i in range(len(x)): x.at[i,'D'] = list(np.random.rand(3))
You can replace the list(np.random.rand(3)) with your own data.
Related
I'm trying to using pandas to append a blank row based on the values in the first column. When the first six characters in the first column don't match, I want an empty row between them (effectively creating groups). Here is an example of what the output could look like:
002446
002447-01
002447-02
002448
This is what I was able to put together thus far.
readie=pd.read_csv('title.csv')
i=0
for row in readie:
readie.append(row)
i+=1
if readie['column title'][i][0:5]!=readie['column title'][i+1][0:5]:
readie.append([])
When running this code, I get the following error message:
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I believe there are other ways to do this, but I would like to use pandas if at all possible.
I'm using the approach from this answer.
Assuming that strings like '123456' and '123' are considered as not matching:
df_orig = pd.DataFrame(
{'col':['002446','002447-01','002447-02','002448','00244','002448']}
)
df = df_orig.reset_index(drop=True) # reset your index
first_6 = df['col'].str.slice(stop=6)
mask = first_6 != first_6.shift(fill_value=first_6[0])
df.index = df.index + mask.cumsum()
df = df.reindex(range(df.index[-1] + 1))
print(df)
col
0 002446
1 NaN
2 002447-01
3 002447-02
4 NaN
5 002448
6 NaN
7 00244
8 NaN
9 002448
I am trying to convert a list of 2d-dataframes into one large dataframe. Lets assume I have the following example, where I create a set of dataframes, each one having the same columns / index:
import pandas as pd
import numpy as np
frames = []
names = []
frame_columns = ['DataPoint1', 'DataPoint2']
for i in range(5):
names.append("DataSet{0}".format(i))
frames.append(pd.DataFrame(np.random.randn(3, 2), columns=frame_columns))
I would like to convert this set of dataframes into one dataframe df which I can access using df['DataSet0']['DataPoint1'].
This dataset would have to have a multi-index consisting of the product of ['DataPoint1', 'DataPoint2'] and the index of the individual dataframes (which is of course the same for all individual frames).
Conversely, the columns would be given as the product of ['Dataset0', ...] and ['DataPoint1', 'DataPoint2'].
In either case, I can create a corresponding MultiIndex and derive an (empty) dataframe based on that:
mux = pd.MultiIndex.from_product([names, frames[0].columns])
frame = pd.DataFrame(index=mux).T
However, I would like to have the contents of the dataframes present rather than having to then add them.
Note that a similar question has been asked here. However, the answers seem to revolve around the Panel class, which is, as of now, deprecated.
Similarly, this thread suggests a join, which is not really what I need.
You can use concat with keys:
total_frame = pd.concat(frames, keys=names)
Output:
DataPoint1 DataPoint2
DataSet0 0 -0.656758 1.776027
1 -0.940759 1.355495
2 0.173670 0.274525
DataSet1 0 -0.744456 -1.057482
1 0.186901 0.806281
2 0.148567 -1.065477
DataSet2 0 -0.980312 -0.487479
1 2.117227 -0.511628
2 0.093718 -0.514379
DataSet3 0 0.046963 -0.563041
1 -0.663800 -1.130751
2 -1.446891 0.879479
DataSet4 0 1.586213 1.552048
1 0.196841 1.933362
2 -0.545256 0.387289
Then you can extract Dataset0 by:
total_frame.loc['DataSet0']
If you really want to use MultiIndex columns instead, you can add axis=1 to concat:
total_frame = pd.concat(frames, axis=1, keys=names)
I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).
I have a long series like the following:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
In [151]: series
Out[151]:
0 [(1, 2)]
1 [(3, 5)]
2 []
3 [(3, 5)]
dtype: object
I want to remove all entries with an empty list. For some reason, boolean indexing does not work.
The following tests both give the same error:
series == [[(1,2)]]
series == [(1,2)]
ValueError: Arrays were different lengths: 4 vs 1
This is very strange, because in the simple example below, indexing works just like above:
In [146]: pd.Series([1,2,3]) == [3]
Out[146]:
0 False
1 False
2 True
dtype: bool
P.S. ideally, I'd like to split the tuples in the series into a DataFrame of two columns also.
You could check to see if the lists are empty using str.len():
series.str.len() == 0
and then use this boolean series to remove the rows containing empty lists.
If each of your entries is a list containing a two-tuple (or else empty), you could create a two-column DataFrame by using the str accessor twice (once to select the first element of the list, then to access the elements of the tuple):
pd.DataFrame({'a': series.str[0].str[0], 'b': series.str[0].str[1]})
Missing entries default to NaN with this method.
Using the built in apply you can filter by the length of the list:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
series = series[series.apply(len) > 0]
Your series is in a bad state -- having a Series of lists of tuples of ints
buries the useful data, the ints, inside too many layers of containers.
However, to form the desired DataFrame, you could use
df = series.apply(lambda x: pd.Series(x[0]) if x else pd.Series()).dropna()
which yields
0 1
0 1 2
1 3 5
2 3 5
A better way would be to avoid building the malformed series altogether and
form df directly from the data:
data = [[(1,2)],[(3,5)],[],[(3,5)]]
data = [pair for row in data for pair in row]
df = pd.DataFrame(data)
I import a CSV as a DataFrame using:
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
Then I'm trying to do a simple replace based on IDs:
df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson'
I get the following error:
AttributeError: 'list' object has no attribute 'loc'
Note, when I do print pd.version() I get 0.12.0, so it's not a problem (at least as far as I understand) with having pre-11 version. Any ideas?
To pickup from the comment: "I was doing this:"
df = [df.hc== 2]
What you create there is a "mask": an array with booleans that says which part of the index fulfilled your condition.
To filter your dataframe on your condition you want to do this:
df = df[df.hc == 2]
A bit more explicit is this:
mask = df.hc == 2
df = df[mask]
If you want to keep the entire dataframe and only want to replace specific values, there are methods such replace: Python pandas equivalent for replace. Also another (performance wise great) method would be creating a separate DataFrame with the from/to values as column and using pd.merge to combine it into the existing DataFrame. And using your index to set values is also possible:
df[mask]['fname'] = 'Johnson'
But for a larger set of replaces you would want to use one of the two other methods or use "apply" with a lambda function (for value transformations). Last but not least: you can use .fillna('bla') to rapidly fill up NA values.
The traceback indicates to you that df is a list and not a DataFrame as expected in your line of code.
It means that between df = pd.read_csv("test.csv") and df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson' you have other lines of codes that assigns a list object to df. Review that piece of code to find your bug
#Boud answer is correct. Loc assignment works fine if the right-hand-side list matches the number of replacing elements
In [56]: df = DataFrame(dict(A =[1,2,3], B = [4,5,6], C = [7,8,9]))
In [57]: df
Out[57]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [58]: df.loc[1,['A','B']] = -1,-2
In [59]: df
Out[59]:
A B C
0 1 4 7
1 -1 -2 8
2 3 6 9