Why does vaex change column names that contain a period? - python

When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined.
After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0 and that vaex renames it to column_2_0 but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:
import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)
for col in dfv.column_names:
dfv = dfv[dfv[col].notna()]
dfv.count()
...
NameError: name 'abc_1_1' is not defined
In this case it appears that vaex tries to rename abc.1 to abc_1 which is already taken so instead it ends up using abc_1_1.
I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1'), but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.
I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.
What are some ideas to deal with this problem other than the two I mentioned above?

In Vaex the columns are in fact "Expressions". Expressions allow you do build sort of a computational graph behind the scenes as you are doing your regular dataframe operations. However, that requires the column names to be as "clean" as possible.
So column names like '2', or '2.5' are not allows, since the expression system can interpret them as numbers rather than column names. Also column names like 'first-name', the expressions system can interpret as df['first'] - df['name'].
To avoid this, vaex will smartly rename columns so that they can be used in the expression system. This is extremely complicated actually. So in your example above, you've found a case that has not been covered yet (isna/ notna).
Btw, you can always access the original names via df.get_column_names(alias=True).

Related

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Why is the `df.columns` an empty list while I can see the column names if I print out the dataframe? Python Pandas

import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.

Can't access part of Pandas dataframe by multiindex

I'm new with Pandas so this is basic question. I created a Dataframe by concatenating two previous Dataframes. I used
todo_pd = pd.concat([rabia_pd, capitan_pd], keys=['Rabia','Capitan'])
thinking that in the future I could separate them easily and saving each one to a different location. Right now I'm being unable to do this separation using the keys I defined with the concat function.
I've tried simple things like
half_dataframe = todo_pd['Rabia']
but it throws me an error saying that there is a problem with the key.
I've also tried with other options I've found in SO, like using the
_get_values('Rabia'),or the.index._get_level_values('Rabia')features, but they all throw me different errors regarding that it does not recognize a string as a way to access the information, or that it requires positional argument: 'level'
The whole Dataframe contains about 22 columns, and I just want to retrieve from the "big dataframe" the part indexed as 'Rabia' and the part index as 'Capitan'.
I'm sure it has a simple solution that I'm not getting for my lack of practice with Pandas.
Thanks a lot,
Use DataFrame.xs:
df1 = todo_pd.xs('Rabia')
df2 = todo_pd.xs('Capitan')

Replicating Excel's VLOOKUP in Python Pandas

Would really appreciate some help with the following problem. I'm intending on using Pandas library to solve this problem, so would appreciate if you could explain how this can be done using Pandas if possible.
I want to take the following excel file:
Before
and:
1)convert the 'before' file into a pandas data frame
2)look for the text in 'Site' column. Where this text appears within the string in the 'Domain' column, return the value in 'Owner' Column under 'Output'.
3)the result should look like the 'After' file. I would like to convert this back into CSV format.
After
So essentially this is similar to an excel vlookup exercise, except its not an exact match we're looking for between the 'Site' and 'Domain' column.
I have already attempted this in Excel but im looking at over 100,000 rows, and comparing them against over 1000 sites, which crashes excel.
I have attempted to store the lookup list in the same file as the list of domains we want to classify with the 'Owner'. If there's a much better way to do this eg storing the lookup list in a separate data frame altogether, then that's fine.
Thanks in advance for any help, i really appreciate it.
Colin
I think the OP's question differs somewhat from the solutions linked in the comments which either deal with exact lookups (map) or lookups between dataframes. Here there is a single dataframe and a partial match to find.
import pandas as pd
import numpy as np
df = pd.ExcelFile('data.xlsx').parse(0)
df = df.astype(str)
df['Test'] = df.apply(lambda x: x['Site'] in x['Domain'],axis=1)
df['Output'] = np.where(df['Test']==True, df['Owner'], '')
df
The lambda allows reiteration of the in test to be applied across the axis, to return a boolean in Test. This then acts as a rule for looking up Owner and placing in Output.

What is the difference between these two Python pandas dataframe commands?

Let's say I have an empty pandas dataframe.
import pandas as pd
m = pd.DataFrame(index=range(1,100), columns=range(1,100))
m = m.fillna(0)
What is the difference between the following two commands?
m[2][1]
m[2].ix[1] # This code actually allows you to edit the dataframe
Feel free to provide further reading if it would be helpful for future reference.
The short answer is that you probably shouldn't do either of these (see #EdChum's link for the reason):
m[2][1]
m[2].ix[1]
You should generally use a single ix, iloc, or loc command on the entire dataframe anytime you want access by both row and column -- not a sequential column access, followed by row access as you did here. For example,
m.iloc[1,2]
Note that the 1 and 2 are reversed compared to your example because ix/iloc/loc all use standard syntax of row then column. Your syntax is reversed because you are chaining, and are first selecting a column from a dataframe (which is a series) and then selecting a row from that series.
In simple cases like yours, it often won't matter too much but the usefulness of ix/iloc/loc is that they are designed to let you "edit" the dataframe in complex ways (aka set or assign values).
There is a really good explanation of ix/iloc/loc here:
pandas iloc vs ix vs loc explanation?
and also in standard pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html

Categories