Totally newebie with Python, and I'm trying to learn "on the field".
So basically I managed to open a csv file, pick only the rows that have certain values in specific columns, and then print the rows.
What I'd love to do after this is basically get a random selection of one of the found rows.
I thought to do that by creating a new csv file first, which at this point will only contains the filtered rows, and then randomly select from it.
Any ideas on the simplest way to do that?
Here's the portion of the code so far:
import csv
with open("top2018.csv") as f:
reader = csv.reader(f)
for row in reader:
if (row[4] >= "0.8") and (row[6] <= "-4") and (row[12] >= "0.8"):
print(row[2] + " -", row[1])
It will find 2 rows (I checked).
And then, for creating a new csv file:
import pandas as pd
artist = [row[2]]
name = [row[1]]
dict = {'artist': artist, 'name': name}
df = pd.DataFrame(dict)
df.to_csv('test.csv')
But I don't know why with this method, the new csv file has only 1 entry, while I'd want to have all of the found rows in it.
Hope something I wrote make sense!
Thanks guys!
You are mixing columns and rows, maybe you should rename the variable row to record so you see better what is happening. Unfortunately, I have to guess as to how the data file could look like...
The dict variable (try not to use this name, this is actually a built-in function and you don't want to overwrite it) is creating two columns, "artist", and "name", which seem to have values like [1.2]. So, dict (try to print it) could look like {"artist":[2.0], "name":[3.1]}, which is a single row, two column entity.
artist name
2.0 3.1
Try to get into pandas, use the df = pd.read_csv() and df[df.something > 0.3] style notation to filter tables, using the csv package is better suited for truly tricky data wrangling.
I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck
I have a csv file with logs.
I need to analyze it and select the necessary information from the file.
The problem is that it has a lot of tables with headers. They don't have names.
Tables are separated by empty rows and are also separated from each other.
Let's say I need to select all data from the %idle column, where CPU = all
Structure:
09:20:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
09:21:06,all,4.98,0.00,5.10,0.00,0.00,0.00,0.06,0.00,89.86
09:21:06,0,12.88,0.00,5.62,0.03,0.00,0.02,1.27,0.00,80.18
12:08:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
12:09:06,all,5.48,0.00,5.24,0.00,0.00,0.00,0.12,0.00,89.15
12:09:06,0,18.57,0.00,5.35,0.02,0.00,0.00,3.00,0.00,73.06
09:20:06,runq-sz,plist-sz,ldavg-1,ldavg-5,ldavg-15
09:21:06,3,1444,2.01,2.12,2.15
09:22:06,4,1444,2.15,2.14,2.15
You can use below program to parse this csv.
result={}
with open("log.csv","r") as f:
for table in f.read().split("\n\n"):
rows=table.split("\n")
header=rows[0]
for row in rows[1:]:
for i,j in zip(header.split(",")[1:],row.split(",")[1:]):
if i in result:
result[i].append(j)
else:
result[i]=[j]
print(result["%idle"])
Output (values of %idle)
['89.86', '80.18', '89.15', '73.06']
This assumes the table column and row values are in same order and no two tables have common column name.
One rather dumb solution would be to use an "ordinary" file reader for the original CSV. You can read everything up to a new line break as a single CSV and then parse the text you just read in memory.
Every time you "see" a line break, you know to treat it as an entirely new CSV, so you can repeat the above procedure for it.
For example, you would have one string that contained:
09:20:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
09:21:06,all,4.98,0.00,5.10,0.00,0.00,0.00,0.06,0.00,89.86
09:21:06,0,12.88,0.00,5.62,0.03,0.00,0.02,1.27,0.00,80.18
and then parse it in memory. Once you get to the line break after that, you would know that you needed a new string containing the following:
12:08:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
12:09:06,all,5.48,0.00,5.24,0.00,0.00,0.00,0.12,0.00,89.15
12:09:06,0,18.57,0.00,5.35,0.02,0.00,0.00,3.00,0.00,73.06
etc. - you can just keep going like this for as many tables as you have.
So I am trying to copy all the data from an excel file with dynamic row length - can range from 100 to 500 rows , which I then want to copy the contents from each cell iterating by column and updating rows to the last row
now my current code updates by Row when I specify the column ID, I am storing
a primary column and non primary column[] , I am not sure how do I iterate a update through cells in each column in my row first , so If I lose my interenet connection for any reason I know till where it got last updated.
Yes this is slow process
The second part is I can open an excel file with openpyxl
read the cell value and store it in a variable but I am struggling to
pass it to the smart sheet code ....
MySheet = smartsheet.Sheets.get_sheet(SHEET_ID, PrimaryCol)
for MyRow in MySheet.rows:
for MyCell in MyRow.cells:
print (MyRow.id, MyCell.value)
row_a = smartsheet.Sheets.get_row(SHEET_ID,MyRow.id)
cell_a = row_a.get_column(PrimaryCol)
cell_a.value = 'new value'
row_a.set_column(cell_a.column_id, cell_a)
smartsheet.Sheets.update_rows(SHEET_ID, [row_a])
Any help would get great thanks
I think these links (Add Rows, Update Rows) will be helpful in achieving the functionality you're looking for.
Ultimately, when ripping through an excel or CSV file, you're going to want to generate the entire row update (and updates for all of the rows) before submitting the update call to Smartsheet.
It appears in your code that you're making an update call for each cell in your sheet. So on a high-level, you might try first getting all of the columnIDs for your sheet and then for each row in your excel file, generating an update/add call for that new row.
Your last step should be a single call to the sheet that contains all of the row updates you're looking for. The last call should very seriously look something like:
smartsheet.Sheets.update_rows(SHEET_ID, ROW_UPDATES)
Where ROW_UPDATES is a list of all the row objects you're adding/updating.
I tried using like this,
sheet.get_highest_row()
but the output was 104578
I have only 3 rows in my excel sheet. I need to select only the number of rows present in the Excel sheet
Open the file in Excel and press Ctrl-End. This will select the last cell according to Excel. This is probably an empty cell somewhere in row 104578. If this is the case, select all empty rows in Excel, delete and save. This should update the last row.
To get the number of rows, you simply get the property max_row from the worksheet object:
rows = sheet.max_row
print(rows)
In your case, it prints 3.