Create a new dataset after selection in Python - python

Totally newebie with Python, and I'm trying to learn "on the field".
So basically I managed to open a csv file, pick only the rows that have certain values in specific columns, and then print the rows.
What I'd love to do after this is basically get a random selection of one of the found rows.
I thought to do that by creating a new csv file first, which at this point will only contains the filtered rows, and then randomly select from it.
Any ideas on the simplest way to do that?
Here's the portion of the code so far:
import csv
with open("top2018.csv") as f:
reader = csv.reader(f)
for row in reader:
if (row[4] >= "0.8") and (row[6] <= "-4") and (row[12] >= "0.8"):
print(row[2] + " -", row[1])
It will find 2 rows (I checked).
And then, for creating a new csv file:
import pandas as pd
artist = [row[2]]
name = [row[1]]
dict = {'artist': artist, 'name': name}
df = pd.DataFrame(dict)
df.to_csv('test.csv')
But I don't know why with this method, the new csv file has only 1 entry, while I'd want to have all of the found rows in it.
Hope something I wrote make sense!
Thanks guys!

You are mixing columns and rows, maybe you should rename the variable row to record so you see better what is happening. Unfortunately, I have to guess as to how the data file could look like...
The dict variable (try not to use this name, this is actually a built-in function and you don't want to overwrite it) is creating two columns, "artist", and "name", which seem to have values like [1.2]. So, dict (try to print it) could look like {"artist":[2.0], "name":[3.1]}, which is a single row, two column entity.
artist name
2.0 3.1
Try to get into pandas, use the df = pd.read_csv() and df[df.something > 0.3] style notation to filter tables, using the csv package is better suited for truly tricky data wrangling.

Related

Python for Excel: While value in column A is the same, take remaining row data and copy paste to new worksheet

I am somewhat of a beginner to python and have encountered the following problem working with openpyxl. For example I have the sample worksheet below: Worksheet
What I am trying to do is loop through the Boat ID column and while the values of the cell are the equivalent I want to take the respective row data to the right and open a new worksheet/workbook and copy paste rows in Col B:E.
So in theory, for every Boat ID = 1 we would take every row unique to ID 1 from Cols B:E open a new workbook and paste them accordingly. Next, for every Boat ID = 2 we would take the rows 5-8 in cols B:E, open a new workbook and paste accordingly. Similarly, we would repeat the process for every Boat ID = 3.
P.S. To keep it simple I have ordered the table by Boat ID in ascending order, but if someone wants bonus points they could opine on how it would be done if the table was not ordered.
I haven't worked with openpyxl, but this is relatively simple with pandas, I would definitely check it out.
Loading the sheet is as easy as:
import pandas as pd
df = pd.read_excel("path_to_file")
You can then filter the table by doing something like:
for i in range(df['Boat Id'].max):
temp_df = df.loc[df[‘Boat ID’]==i]
temp_df.to_excel(f”path_to_save_location/sheet_{i}.xlsx”)
This doesn't handle for duplicate rows with the same boat id, but should get you most of the way!
I would also highly recommend the jupyter notebook extension for prototyping stuff in pandas

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Applying python dictionary to column of a CSV file

I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.
It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.

Using python pandas how can we select very specific rows and associated column

I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck

Is this possible in python with pandas or xlwt

I have an existing excel. That looks like
and I have another excel that has around 40000 rows and around 300 columns. shortened version looks like
I would like to append values to my existing excel from second excel. But only values that match values in col4 from my existing excel. So i would get something like this
Hope you guys get the picture of what I am trying to do.
yes, that is possible in pandas and it is way faster than anything in excel
df_result = pd.merge(FirstTable, SecondTable, how='left', on='col4')
this will look into both the tables for column "col4" so it needs to be named this way in both the tables.
Also be aware of the fact that if you have multiple values in second table for single value in the first table it will make as many lines in the result as in the second table.
to read the excel you can use:
import pandas as pd
xl=pd.ExcelFile('MyFile.xlsx')
FirstTable = pd.read_excel(xl, 'sheet_name_FIRST_TABLE')
for more detailed description see documentation

Categories