I'm not familiar with Python,but aim to get data using Pandas from below data format.
Is there any method to get far right hand data from each row? Total rows of this data reaches over 60,000 and last level of each row is vary.
To access the column in pandas you'll have to give it a name as well ('some_name' for this example). Then it should be as easy as
import pandas as pd
df = pd.read_excel('path/to/your/file')
target = df['some_name']
See pandas.read_excel for further details.
IIUC, you want the last value in each column that is not None. I am assuming that after reading your data using pd.read_csv() your data looks something like this -
# CREATING DUMMY DATA
a = [['AA000',np.nan,np.nan,np.nan],
['AA006','AA001',np.nan,np.nan],
['AA008','AA002',np.nan,np.nan],
['AA002','AA003','AA003',np.nan],
['AA002','AA006','AA004',np.nan]]
df = pd.DataFrame(a, columns=['Level1','Level2','Level3','Level4'])
df
Note: I modified an old answer of mine for this solution, so if you are interested in knowing what is happening here, do check it out. In a nutshell, I have oriented/flipped the data in such a way, that the values you want, become the FIRST values in each row, instead of the last. That way when you do argmax(1), it returns the index of the first occurrence of the notna values. Incase of idxmax it returns the column index directly, instead of the integer value for idx.
Pandas method:
You can use pandas to solve this as -
result = df.lookup(range(df.shape[0]), df.iloc[:, ::-1].notna().idxmax(1))
result
array(['AA000', 'AA001', 'AA002', 'AA003', 'AA004'], dtype=object)
Here is a visual explanation for the column index
df.iloc[:, ::-1].notna().idxmax(1)
|______________| |_____| |_______|
| | |
horizontal-flipped bool column idx of first True value
Numpy method:
You can use NumPy to solve this as follows -
import numpy as np
col_idx = df.shape[1] - np.fliplr(df.notna().values).argmax(1) - 1
row_idx = np.arange(df.shape[0])
result = df.values[row_idx, col_idx]
result
array(['AA000', 'AA001', 'AA002', 'AA003', 'AA004'], dtype=object)
Visual explanation for what is happening:
df.shape[1] - np.fliplr(df.notna().values).argmax(1) - 1
|_________| |___________________________| |___________|
| | |
# of rows horizontal-flipped matrix idx of first notna
Finally, set it to a column by simply assigning it to a new column name in df -
df['last'] = result
print(df)
Level1 Level2 Level3 Level4 last
0 AA000 NaN NaN NaN AA000
1 AA006 AA001 NaN NaN AA001
2 AA008 AA002 NaN NaN AA002
3 AA002 AA003 AA003 NaN AA003
4 AA002 AA006 AA004 NaN AA004
Related
I have the following data frame coming from an API source, I'm trying to wrangle the data whilst not massively changing my original dataframe (don't want to do a cartesian product essentially)
data = ["[['Key','Metric','Value'],['foo','bar','4'],['foo2','bar2','55.21']]",
"[['Key','Metric','Value'],['foo','bar','5']]",
"[['Key','Metric','Value'],['foo','bar','6'],['foo1','bar1',''],['foo2','bar2','57.75']]"]
df = pd.DataFrame({'id' : [0,1,2],'arr' : data})
print(df)
id arr
0 0 [['Key','Metric','Value'],['foo','bar','4'],['...
1 1 [['Key','Metric','Value'],['foo','bar','5']]
2 2 [['Key','Metric','Value'],['foo','bar','6'],['...
The Key Value Metric tells the order of the arrays within what I'm trying to do is order it in a dictionary fashion of {key : value} where the key is the Key & Metric fields joined and the value is -1 index of the nested list.
The source data is coming via excel & the MS Graph API, I don't envisage that it will change, but it may so I'm trying to come up with a dynamic solution.
my target dataframe is :
target_df = pd.DataFrame({'id' : [0,1,2],
'foo_bar' : [4,5,6],
'foo1_bar1' : [np.nan, np.nan,''],
'foo2_bar2' : [55.21, np.nan, 57.75]})
print(target_df)
id foo_bar foo1_bar1 foo2_bar2
0 0 4 NaN 55.21
1 1 5 NaN NaN
2 2 6 57.75
my own attemps have been to use literal_eval from the ast library to get the first list which will always be the Key Metric & Value column - there maybe in future a Key Metric , Metric2, Value field - hence my desire to keep things dynamic.
there will always be a single Key & Value field.
Own attempt :
from ast import literal_eval
literal_eval(df['arr'][0])[0]
#['Key', 'Value', 'Metric']
with this i replaced the list characters and split by , then converted the result to a dataframe :
df['arr'].str.replace('\[|\]','').str.split(',',expand=True)
however after this I haven't made much clear head-way and wondering If im going about this the wrong way?
Try:
df2=df["arr"].map(eval).apply(lambda x: pd.Series({f"{el[0]}_{el[1]}": el[2] for el in x[1:]}))
df2["id"]=df["id"]
Output:
foo_bar foo2_bar2 foo1_bar1 id
0 4 55.21 NaN 0
1 5 NaN NaN 1
2 6 57.75 2
IIUC, you can loop over each row and use literal_eval, create dataframes, set_index the first two columns and transpose. then concat plus rename the columns, and create the column id:
from ast import literal_eval
df_target = pd.concat([pd.DataFrame.from_records(literal_eval(x)).drop(0).set_index([0,1]).T
for x in df.arr.to_numpy()],
ignore_index=True,
keys=df.id) #to keep the ids
# rename the columns as wanted
df_target.columns = ['{}_{}'.format(*col) for col in df_target.columns]
# add the ids as a column
df_target = df_target.reset_index().rename(columns={'index':'id'})
print (df_target)
id foo_bar foo1_bar1 foo2_bar2
0 0 4 NaN 55.21
1 1 5 NaN NaN
2 2 6 57.75
I'm still not entirely sure I understand every aspect of the question, but here's what I have so far.
import ast
import pandas as pd
data = ["[['Key','Metric','Value'],['foo','bar','4'],['foo2','bar2','55.21']]",
"[['Key','Metric','Value'],['foo','bar','5']]",
"[['Key','Metric','Value'],['foo','bar','6'],['foo1','bar1',''],['foo2','bar2','57.75']]"]
nested_lists = [ast.literal_eval(elem)[1:] for elem in data]
row_dicts = [{'_'.join([key, metric]): value for key, metric, value in curr_list} for curr_list in nested_lists]
df = pd.DataFrame(data=row_dicts)
print(df)
Output:
foo_bar foo2_bar2 foo1_bar1
0 4 55.21 NaN
1 5 NaN NaN
2 6 57.75
nested_lists and row_dicts are list comprehension since it makes debugging easier, but you can of course transform them into generator expressions.
I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')
I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values
I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve
I have two Series s1 and s2 with the same (non-consecutive) indices. How do I combine s1 and s2 to being two columns in a DataFrame and keep one of the indices as a third column?
I think concat is a nice way to do this. If they are present it uses the name attributes of the Series as the columns (otherwise it simply numbers them):
In [1]: s1 = pd.Series([1, 2], index=['A', 'B'], name='s1')
In [2]: s2 = pd.Series([3, 4], index=['A', 'B'], name='s2')
In [3]: pd.concat([s1, s2], axis=1)
Out[3]:
s1 s2
A 1 3
B 2 4
In [4]: pd.concat([s1, s2], axis=1).reset_index()
Out[4]:
index s1 s2
0 A 1 3
1 B 2 4
Note: This extends to more than 2 Series.
Why don't you just use .to_frame if both have the same indexes?
>= v0.23
a.to_frame().join(b)
< v0.23
a.to_frame().join(b.to_frame())
Pandas will automatically align these passed in series and create the joint index
They happen to be the same here. reset_index moves the index to a column.
In [2]: s1 = Series(randn(5),index=[1,2,4,5,6])
In [4]: s2 = Series(randn(5),index=[1,2,4,5,6])
In [8]: DataFrame(dict(s1 = s1, s2 = s2)).reset_index()
Out[8]:
index s1 s2
0 1 -0.176143 0.128635
1 2 -1.286470 0.908497
2 4 -0.995881 0.528050
3 5 0.402241 0.458870
4 6 0.380457 0.072251
If I may answer this.
The fundamentals behind converting series to data frame is to understand that
1. At conceptual level, every column in data frame is a series.
2. And, every column name is a key name that maps to a series.
If you keep above two concepts in mind, you can think of many ways to convert series to data frame.
One easy solution will be like this:
Create two series here
import pandas as pd
series_1 = pd.Series(list(range(10)))
series_2 = pd.Series(list(range(20,30)))
Create an empty data frame with just desired column names
df = pd.DataFrame(columns = ['Column_name#1', 'Column_name#1'])
Put series value inside data frame using mapping concept
df['Column_name#1'] = series_1
df['Column_name#2'] = series_2
Check results now
df.head(5)
Example code:
a = pd.Series([1,2,3,4], index=[7,2,8,9])
b = pd.Series([5,6,7,8], index=[7,2,8,9])
data = pd.DataFrame({'a': a,'b':b, 'idx_col':a.index})
Pandas allows you to create a DataFrame from a dict with Series as the values and the column names as the keys. When it finds a Series as a value, it uses the Series index as part of the DataFrame index. This data alignment is one of the main perks of Pandas. Consequently, unless you have other needs, the freshly created DataFrame has duplicated value. In the above example, data['idx_col'] has the same data as data.index.
Not sure I fully understand your question, but is this what you want to do?
pd.DataFrame(data=dict(s1=s1, s2=s2), index=s1.index)
(index=s1.index is not even necessary here)
A simplification of the solution based on join():
df = a.to_frame().join(b)
If you are trying to join Series of equal length but their indexes don't match (which is a common scenario), then concatenating them will generate NAs wherever they don't match.
x = pd.Series({'a':1,'b':2,})
y = pd.Series({'d':4,'e':5})
pd.concat([x,y],axis=1)
#Output (I've added column names for clarity)
Index x y
a 1.0 NaN
b 2.0 NaN
d NaN 4.0
e NaN 5.0
Assuming that you don't care if the indexes match, the solution is to reindex both Series before concatenating them. If drop=False, which is the default, then Pandas will save the old index in a column of the new dataframe (the indexes are dropped here for simplicity).
pd.concat([x.reset_index(drop=True),y.reset_index(drop=True)],axis=1)
#Output (column names added):
Index x y
0 1 4
1 2 5
I used pandas to convert my numpy array or iseries to an dataframe then added and additional the additional column by key as 'prediction'. If you need dataframe converted back to a list then use values.tolist()
output=pd.DataFrame(X_test)
output['prediction']=y_pred
list=output.values.tolist()