How can I add an incomplete row in a Panda DataFrame? - python

I'm trying to copy a row from a DataFrame to another. The issue comes from that the origin has not as many columns as the destination, leading to a situation looking like :
origin = pd.DataFrame([[1,2],
[3,4]],columns=['A','B'])
destination = pd.DataFrame(columns=['A', 'B', 'C'])
copy = origin[0:1].to_dict()
destination.loc[0] = copy
I'm getting a 'ValueError: cannot set a row with mismatched columns'
I tested with two identical df, and it worked fine. What would be the best way to do what I'm trying? I was thinking of dynamically add NaNs for the additional destination columns, but it doesn't seem very pythonic.
Please note that I'm trying to avoid any append(), as I will perform the task frequently, and I read in Pandas doc that it would probably give perfomance issues.
Thanks for your help!

Insert Series
destination.loc[0]=pd.DataFrame(copy).iloc[0]
destination
Out[672]:
A B C
0 1.0 2.0 NaN

Related

Am I able to assign keywords to a dataset header retrieved through Pandas and input the information under the header into eg an equation?

The title does say a lot but I'm completely new to Python and Pandas as I want to know how to do this. I have a dataframe that's roughly 21,000 rows with 4 headers. I am wanting to assign a 'keyword' (or use the original header) to each column with 21,000 rows, and have the information with the relevant data inputted into an equation with other headers doing the same thing. This data will eventually be exported to ArcGIS for processing into visual markers. I've gotten as far as retrieving the information into Pandas, and now I'm stuck.
Thank you in advance.
I can't comment as I don't have enough rep. Have you tried df['column name']. This is how you can use a column in equations. So, without knowing what you actually need, if you want to use columns in equations, you can do something like this:
df['new column'] = 3 * df['column name']
Or if you want to multiply 2 columns:
df['c'] = df['a'] * df['b']
Or use multiply method - (check here for original post):
df[[c'] = df[['a', 'b']].multiply(df['c'], axis="index")
Without actually seeing a df sample or seeing the intended output, I can't be sure any of these is what you're actually after

Filter nan values out of rows in pandas

I am working on a calculator to determine what to feed your fish as a fun project to learn python, pandas, and numpy.
My data is organized like this:
As you can see, my fishes are rows, and the different foods are columns.
What I am hoping to do, is have the user (me) input a food, and have the program output to me all those values which are not nan.
The reason why I would prefer to leave them as nan rather than 0, is that I use different numbers in different spots to indicate preference. 1 is natural diet, 2 is ok but not ideal, 3 is live only.
Is there anyway to do this using pandas? Everywhere I look online helps me filter rows out of columns, but it is quite difficult to find info on filter columns out of rows.
Currently, my code looks like this:
import pandas as pd
import numpy as np
df = pd.read_excel(r'C:\Users\Daniel\OneDrive\Documents\AquariumAiMVP.xlsx')
clownfish = df[0:1]
angelfish = df[1:2]
damselfish = df[2:3]
So, as you can see, I haven't really gotten anywhere yet. I tried filtering out the nulls using, the following idea:
clownfish_wild_diet = pd.isnull(df.clownfish)
But it results in an error, saying:
AttributeError: 'DataFrame' object has no attribute 'clownfish'
Thanks for the help guys. I'm a total pandas noob so it is much appreciated.
You can use masks in pandas:
food = 'Amphipods'
mask = df[food].notnull()
result_set = df[mask]
df[food].notnull() returns a mask (a Series of boolean values indicating if the condition is met for each row), and you can use that mask to filter the real DF using df[mask].
Usually you can combine these two rows to have a more pythonic code, but that's up to you:
result_set = df[df[food].notnull()]
This returns a new DF with the subset of rows that meet the condition (including all columns from the original DF), so you can use other operations on this new DF (e.g selecting a subset of columns, drop other missing values, etc)
See more about .notnull(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html

Pandas style-dataframe has multi-index columns misaligned

I need to style a Pandas dataframe that has a multi-index column arrangement. As an example of my dataframe, consider:
df = pd.DataFrame([[True for a in range(12)]])
df.columns = pd.MultiIndex.from_tuples([(a,b,c) for a in ['Tim','Sarah'] for b in ['house','car','boat'] for c in ['new','used']])
df
This displays the multi-index columns just as I would expect it to, and is easy to read:
However, when I convert it to a style dataframe:
df.style
The column headers suddenly shift and it becomes confusing to figure out where each level begins and ends:
Can anyone help me undo this, to return it to the more readable left-justified setup? I looked through about 10 other posts on SO but none addressed this issue.
TIA.
UPDATE 12/9:
My primary product is a .to_excel() output, and it displays correctly, I just learned. So while this issue is still open for solutions, I am not in urgent need of a solution, but am in search of one nonetheless. Thanks.

how can a specific cell be accessed in a vaex data frame?

vaex is a library similar to pandas, that provides a dataframe class
I'm looking for a way to access a specific cell by row and column
for example:
import vaex
df = vaex.from_dict({'a': [1,2,3], 'b': [4,5,6]})
df.a[0] # this works in pandas but not in vaex
In this specific case you could do df.a.values[0], but if this was a virtual column, it would lead to the whole column being evaluated. What would be faster to do (say in a case of > 1 billon rows, and a virtual column), is to do:
df['r'] = df.a + df.b
df.evaluate('r', i1=2, i2=3)[0]
This will evaluate the virtual column/expression r, from row 2 to 3 (an array of length 1), and get the first element.
This is rather clunky, and there is an issue open on this: https://github.com/vaexio/vaex/issues/238
Maybe you are surprised that vaex does not have something as 'basic' as this, but vaex is often used for really large datasets, where you don't access individual rows that often, so we don't run into this a lot.
#Maarten Breddels is the owner of Vaex, so I would take his word. But it's possible he wrote that answer before Vaex added slicing, which in this case would be much less "clunky" as described.
import vaex
df = vaex.example()
df.x[:1].values # Access row 0
df.x[1:3].values # Access rows 1 and 2

pandas dataframe add new column based on calulation on other column and avoid chained index

I have a pandas dataframe and I need to add a new column, which would be based on calculation of specific columns,indicated by a column 'site'. I have found a way to do this with resort to numpy, but always it gives warning about chained index. I am sure there should be better solution, please help if you know.
df_num_bin1['Chip_id_3']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_89_S1]*0x100+df_num_bin1[WB_78_S1],df_num_bin1[WB_89_S2]*0x100+df_num_bin1[WB_78_S2])
df_num_bin1['Chip_id_2']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_67_S1]*0x100+df_num_bin1[WB_56_S1],df_num_bin1[WB_67_S2]*0x100+df_num_bin1[WB_56_S2])
df_num_bin1['Chip_id_1']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_45_S1]*0x100+df_num_bin1[WB_34_S1],df_num_bin1[WB_45_S2]*0x100+df_num_bin1[WB_34_S2])
df_num_bin1['Chip_id_0']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_23_S1]*0x100+df_num_bin1[WB_12_S1],df_num_bin1[WB_23_S2]*0x100+df_num_bin1[WB_12_S2])
df_num_bin1['mac_low']=(df_num_bin1['Chip_id_1'].map(int) % 0x10000) *0x100+df_num_bin1['Chip_id_0'].map(int) // 0x1000000
The code above have 2 issues:
1: Here the value of column [key_site_num] determines which columns I should extract chip id data from. In this example it is only of site 0 or 1, but actually it could be 2 or 3 as well. I would need a general solution.
2: it generates chained index warning;
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:35: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Well, i´m not too sure about your first quest but i think that this will help you.
import pandas as pd
reader = pd.read_csv(path,engine='python')
reader['new'] = reader['treasury.maturity.rate']+reader['bond.yield']
reader.to_csv('test.csv',index=False)
As you can see,you don´t need get the values before operate with them only reference the column where they are; and to do the same for only a specific row you could filter the dataframe before create the new column.

Categories