I'm new with Pandas so this is basic question. I created a Dataframe by concatenating two previous Dataframes. I used
todo_pd = pd.concat([rabia_pd, capitan_pd], keys=['Rabia','Capitan'])
thinking that in the future I could separate them easily and saving each one to a different location. Right now I'm being unable to do this separation using the keys I defined with the concat function.
I've tried simple things like
half_dataframe = todo_pd['Rabia']
but it throws me an error saying that there is a problem with the key.
I've also tried with other options I've found in SO, like using the
_get_values('Rabia'),or the.index._get_level_values('Rabia')features, but they all throw me different errors regarding that it does not recognize a string as a way to access the information, or that it requires positional argument: 'level'
The whole Dataframe contains about 22 columns, and I just want to retrieve from the "big dataframe" the part indexed as 'Rabia' and the part index as 'Capitan'.
I'm sure it has a simple solution that I'm not getting for my lack of practice with Pandas.
Thanks a lot,
Use DataFrame.xs:
df1 = todo_pd.xs('Rabia')
df2 = todo_pd.xs('Capitan')
Related
I'm currently trying to merge 2 dataframes based on a "key" that is created as a combination of alphanumeric and special characters. I'm getting a "Value error: can not merge DataFrame with instance of type <class 'str'>".
I'm aware this topic is covered in the thread below, however, the issue seemed very simple in that case where one of the dataset was not a Dataframe.
Python pandas: "can not merge DataFrame with instance of type <class 'str'>"
I have, on the other hand, checked both the datasets I want to merge are of type "DataFrame". So could it be that the "key" I'm using to combine the two dataframes has special characters in it? For ex, one such key could look like "1078-ORD-XHKG-HKDOct-23ATM". I do have similar keys in both dataframes for the merging to proceed, yet I get a Value error.
The code used to merge is simply-
df_new = pd.merge('df_1', 'df_2', on = 'key', how = 'left')
I have tried using the other ways of merging as suggested in other threads as well. Still doesn't help. Can someone please help?
Can you try using this:
pd.merge(df_1,df_2, how='left', left_on=['key df_1'], right_on=['key df_2'])?
When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined.
After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0 and that vaex renames it to column_2_0 but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:
import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)
for col in dfv.column_names:
dfv = dfv[dfv[col].notna()]
dfv.count()
...
NameError: name 'abc_1_1' is not defined
In this case it appears that vaex tries to rename abc.1 to abc_1 which is already taken so instead it ends up using abc_1_1.
I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1'), but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.
I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.
What are some ideas to deal with this problem other than the two I mentioned above?
In Vaex the columns are in fact "Expressions". Expressions allow you do build sort of a computational graph behind the scenes as you are doing your regular dataframe operations. However, that requires the column names to be as "clean" as possible.
So column names like '2', or '2.5' are not allows, since the expression system can interpret them as numbers rather than column names. Also column names like 'first-name', the expressions system can interpret as df['first'] - df['name'].
To avoid this, vaex will smartly rename columns so that they can be used in the expression system. This is extremely complicated actually. So in your example above, you've found a case that has not been covered yet (isna/ notna).
Btw, you can always access the original names via df.get_column_names(alias=True).
import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.
I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates
I'm tearing my hair out a bit with this one. I've imported two csv's into pandas dataframes both have a column called SiteReference i want to use pd.merge to join dataframes using SiteReference as a key.
Initial merged failed as pd.read took different interpretations of the SiteReference values, in one instance 380500145.0 in the other 380500145 both stored as objects. I ran Regex to clean the columns and then pd.to_numeric, this resulted in one value of 380500145.0 and another of 3.805001e+10. They should both be 380500145. I then attempted;
df['SiteReference'] = df['SiteReference'].astype(int).astype('str')
But got back;
ValueError: cannot convert float NaN to integer
How can i control how pandas is dealing with these, preferably on import?
Perharps the best solution is to avoid that pd.read affect the type of this field :
df=pd.read_csv('data.csv',sep=',',dtype={'SiteReference':str})
Following the discussion in the comments, if you want to format floats as integer strings, you can use this:
df['SiteReference'] = df['SiteReference'].map('{:,.0f}'.format)
This should handle null values gracefully.