So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]
Related
I have a dataframe with 160,000 rows and I need to know if these values exist in another column in another different dataframe that has over 7 million rows using Vaex.
I have tried doing this in pandas but it takes way too long to run.
Once I run this code I would like a list or a column that says either "True" or "False" about whether the value exists.
There are few tricks you can do.
Some ideas:
you can try inner join, and then get the list of unique values, which appear in both dataframes. Then you can use the isin method in the smaller dataframe and that list to get your answer.
Dunno if this will work out of the box, but it would be something like:
df_join = df_small.join(df_big, on='key', allow_duplicates=True)
common_samples = df_join[key].tolist()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
# If it is something you gonna reuse a lot, but be worth doing
df_small = df_small.materialize('is_in_df_big') # to put it in memory otherwise it will be lazily recomputed each time you need it.
Similar idea: instead of doing join do something like:
unique_samples = df_small.key.unique()
common_samples = df_big[df_big.key.isin(unique_samples)].key.unqiue()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
I dunno which one would be faster. I hope this at least will lead to some inspiration if not to the full solution.
I have a dataframe (allPop) and a geodataframe (allTracts). I'm merging them on a the column GEOID, which they both share:
newTracts = allTracts.merge(allPop, on='GEOID')
My problem is that I'm losing data on this merge, which conceptually shouldn't be happening. Each of the records in allPop should match with one of the records from allTracts, but newTracts has a couple hundred fewer records than allPop. I'd like to be able to look at the records not being included in the merge to try to diagnose the problem. Is there a way to do this? Or else, can I find the difference between allPop and allTracts based on their columns 'GEOID'? I've seen how to do this when both dataframes have all of the same column names/types, but can I do this based only on one column? I'm not sure what the output for this would look like, but lists of the GEOIDs that aren't being merged from both dataframes would be good. Or else the dataframes themselves without the records that were merged. Thanks!
You can use the isin method in Pandas.
badPop = allPop[~allPop['GEOID'].isin(allTracts['GEOID'])
You can also use the indicator option of the merge method along with how='outer' to find the offending rows.
import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.
As a very very new beginner with Python & Pandas, I am looking for your support regarding an issue.
I need to iterate over columns and find out the maximum value in the concerning rows of a dataframe and write it in a new variable for each row. The number of columns is not manageable, almost 200 columns, therefore I do not want to write each required column id manually. And most importantly that I need to start from a given column id and continue with two columns id increments till a given last columns id.
Will appreciate sample codes, see attachment too.
Try:
df['x']=df.max(axis=1)
Replace x with the name for your desired output column.
I have a folder that contains ~90 CSV files. Each relevant file is named xxxxx-2012 and has the same column names.
I would like to create a single DataFrame with a specific column power(MW) from each file, i.e. 90 columns in total, naming the column in the resulting DataFrame by the file name.
My objective with problems like this is to get to a simple datastructure as quickly as possible. In this case, that could be a dictionary of filenames to DataFrames.
frames = {filename: pd.read_csv(filename) for filename is os.listdir()}
You may have to filter out bad filenames, e.g. by extension, or you may be better off using glob... in either case it breaks up the problem, this shouldn't be too bad.
Then the question becomes much easier*:
How do I get one column from a DataFrame. df[colname].
How do I concat a list of columns to a DataFrame.
*Assuming you know your way around python datastructure e.g. list comprehensions.
Another option is to just concat the entire dict:
pd.concat(frames)
(which gives you a MultiIndex with all the information.)