Iterate Through Folder and Add One Column of Each CSV to Dataframe - python

I have a folder that contains ~90 CSV files. Each relevant file is named xxxxx-2012 and has the same column names.
I would like to create a single DataFrame with a specific column power(MW) from each file, i.e. 90 columns in total, naming the column in the resulting DataFrame by the file name.

My objective with problems like this is to get to a simple datastructure as quickly as possible. In this case, that could be a dictionary of filenames to DataFrames.
frames = {filename: pd.read_csv(filename) for filename is os.listdir()}
You may have to filter out bad filenames, e.g. by extension, or you may be better off using glob... in either case it breaks up the problem, this shouldn't be too bad.
Then the question becomes much easier*:
How do I get one column from a DataFrame. df[colname].
How do I concat a list of columns to a DataFrame.
*Assuming you know your way around python datastructure e.g. list comprehensions.
Another option is to just concat the entire dict:
pd.concat(frames)
(which gives you a MultiIndex with all the information.)

Related

Reading Excel files and detect column name in python

I have some excel files that includes some rows(it could be 1 or more rows) at the top for description and below it, there are the tables with the column names and values. Also, some column names are in two rows that I need to merge them. Also, there are cases that includes three rows for the column name.
I would like to go through it, skip the first lines to detect rows that include the column name. What would be your suggestions for it?

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Applying python dictionary to column of a CSV file

I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.
It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.

How to get column names after importing pickle file into Pandas

I am new to hands-on python and programming in general. I have imported a 6gb pickle file into pandas and been able to display the results of the file. It doesn't look well ordered however. My dataframe has varying rows and 842 columns.
My next task is to;
get column names of all 842 columns so i can find columns that have similar features.
create a new column (series) with data from (1) above
"append" new column to original dataframe
Thus far i have tried the "functions" column, col, dataframe.columns, to get column names but no one is working.
Please see what my program looks like;code and output
You can get list of your dataframe column names using this :
list(your_dataframe.columns)
for adding new columns, check this :
new-columns-in pandas

How to use usecols elements which are regex rather than strings?

I created a script to go over the needed data, using pandas.
I'm now receiving more files that I need to go over, and sadly these files do not have the same headers.
For example I have placed in my list of columns to use 'id_num' and in some of the files it appears as 'num_id'.
Is it possible to still use the usecols list I created, and allow certain elements in it to "connect" with different header strings, for example by using regex?
I assume you're referring to the usecols keyword in pd.read_csv (or some analogous pandas reading)? I'm sure you've gathered that pandas can't do a regex search on a dataframe before it even read the dataframe so I'm fairly certain doing a regex search with the usecols keyword isn't feasible.
However, after you read the csv into a dataframe (let's name it df for the sake of the example), you could very easily filter the columns of interest using regexes.
for example, suppose your new dataframe is loaded into df:
potential_columns = ['num_id', 'id_num']
df_cols = [col for col in df.columns if re.search('|'.join(potential_columns), col)]
You could list all potential columns you want to search for with potential_columns. Then using join create one massive regex search. Then use a list comprehension to aggregate all valid columns in df.columns. Once that's done you can finish this process by calling:
df = df[df_cols]
Dealing with duplicate columns, creating clever keywords to search for is left as an exercise for you.

Categories