Analyze logs with Python - python

I have a csv file with logs.
I need to analyze it and select the necessary information from the file.
The problem is that it has a lot of tables with headers. They don't have names.
Tables are separated by empty rows and are also separated from each other.
Let's say I need to select all data from the %idle column, where CPU = all
Structure:
09:20:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
09:21:06,all,4.98,0.00,5.10,0.00,0.00,0.00,0.06,0.00,89.86
09:21:06,0,12.88,0.00,5.62,0.03,0.00,0.02,1.27,0.00,80.18
12:08:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
12:09:06,all,5.48,0.00,5.24,0.00,0.00,0.00,0.12,0.00,89.15
12:09:06,0,18.57,0.00,5.35,0.02,0.00,0.00,3.00,0.00,73.06
09:20:06,runq-sz,plist-sz,ldavg-1,ldavg-5,ldavg-15
09:21:06,3,1444,2.01,2.12,2.15
09:22:06,4,1444,2.15,2.14,2.15

You can use below program to parse this csv.
result={}
with open("log.csv","r") as f:
for table in f.read().split("\n\n"):
rows=table.split("\n")
header=rows[0]
for row in rows[1:]:
for i,j in zip(header.split(",")[1:],row.split(",")[1:]):
if i in result:
result[i].append(j)
else:
result[i]=[j]
print(result["%idle"])
Output (values of %idle)
['89.86', '80.18', '89.15', '73.06']
This assumes the table column and row values are in same order and no two tables have common column name.

One rather dumb solution would be to use an "ordinary" file reader for the original CSV. You can read everything up to a new line break as a single CSV and then parse the text you just read in memory.
Every time you "see" a line break, you know to treat it as an entirely new CSV, so you can repeat the above procedure for it.
For example, you would have one string that contained:
09:20:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
09:21:06,all,4.98,0.00,5.10,0.00,0.00,0.00,0.06,0.00,89.86
09:21:06,0,12.88,0.00,5.62,0.03,0.00,0.02,1.27,0.00,80.18
and then parse it in memory. Once you get to the line break after that, you would know that you needed a new string containing the following:
12:08:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
12:09:06,all,5.48,0.00,5.24,0.00,0.00,0.00,0.12,0.00,89.15
12:09:06,0,18.57,0.00,5.35,0.02,0.00,0.00,3.00,0.00,73.06
etc. - you can just keep going like this for as many tables as you have.

Related

Trying to write/read CSV file with None objects for empty cells [Python]

I'm trying to read data in format CSV using pandas DataFrame so that the empty cells will be recognized as None values.
the delimiter is ',' and I have two of them wherever I need None value. for example, the row:
12345,'abc','abc',,,12,'abc'
Will be converted to a tuple and replaced to:
(12345,'abc','abc',None,None,12,'abc',)
I need it in order to insert data to MySQL later and I'm using cursor.execute() function with the query and the data
I have tried to load the CSV file to a DataFrame and replace but it is not supported:
chunk = chunk.replace(np.nan, None, regex=True)
Any suggestions?
Sorry I did not attain the meaning of the question completely but if it is in regards to CSV, why don't you have arbitrary values of your choice or even empty strings that you can then change later during the programme for when you want to write the data out or when you read.

Pandas Read csv just read a line of a row

I have a classic panda data frame made of ID and Text. I would like to get just one column and therefore i use the typical df["columnname"]. But at this point it becomes a Pandas Series. Is there a way to make a new dataframe with just that single column?
I'm asking this is because if I cast the Pandas series in a string (columnname = columnname.astype ("string")) and I save it in a text file, I see that it only saves the first sentence of each line and not the entire textual content, as I would like.
If there are any other solution, I'm open to learn :)
Try this: pd.DataFrame(dfname["columnname"])

Applying python dictionary to column of a CSV file

I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.
It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.

Python, Pandas: How to automatically skip Excel header cells and add the rest to a dataframe

Greetings I would like to transform an excel document into a dataframe, but unfortunately the excel documents are made by someone else and they will always headers like so:
Excel example
I would like to ignore the "made by stevens" and "made 04/02/21" parts and just read the relevant information like name, age, file.
How would I skip it using pandas
Is there a way to always skip those header information, even if the relevant info (name, age, file) starts at a different line on different documents? (IE in one document age is at row 4 and in another age is at row 7)
Thanks!
The function pandas.read_excel has a parameter called skiprows, if you feed it an integer it will simply skip the n first lines at the start of the file.
In your case just use:
df = pd.read_excel(filepath, skiprows=4)
The second part of your question this is trickier. Depending on your business use cases you might have different solutions. If the columns are always the same (Name, Age, file) you could import the excel file without skipping lines but with fixing the column names, then by dropping rows with empty data and the additional header row you didn't use.
If you want to skip header which is on row = 1, then you can try this
pandas.read_excel(skiprows=1, skipfooter=0)
you can specify the value in integer to skiprows=1 to skip header and skipfooter=1 to skip footer, the number can depending upon how many rows you want to skip.

Using python pandas how can we select very specific rows and associated column

I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck

Categories