Delete Rows prior to specific String in Pandas

Delete Rows prior to specific String in Pandas - python

I have excel files in following format:
Sensor 1 meta
Sensor 2 meta
"Summary of Observation"
Sensor 1
Sensor 2
The number of rows before and after "Summary of Observation" is not fixed (i.e one file may have only sensor 1,2 while other may have 1,2,3....)
In dataframe, I only want information after "Summary of Observation")
Right now, I open the excel file, note the row from which I want information and parse it in
df = pd.read_excel("1.xlsx",skiprows = %put some value here%)
Is there a way to automate this, i.e. I don't want to open excel. Rather only import relevant rows (or delete them after importing).

After importing the file you can find index and select a data from that point.
# I used column name as `text` you can replace it with yours
idx = df[df['text']=='Summary of Observation'].index[0]
df = df[idx+1:]
print(df)
Output:
text
3 Sensor 1
4 Sensor 2
Or if you want to include Summary of Observation just use idx in place of idx+1

you can open the excel and use df.loc[df[0]=="Summary of Observation"].index[0] to get the index
Working code at https://github.com/gklc811/Python3.6/blob/master/stackoverflowsamples/excel.ipynb

Related

Python [Pandas/docx]: Merging two rows based on common name

I am trying to write a script using docx-python and pandas in Python3 which perform following action:
Take input from csv file
Merge common value of Column C and add each value into docx
Export docx
My raw csv is as below:
SN. Name Instance Severity
1 Name1 file1:line1 CRITICAL
2 Name1 file2:line3 CRITICAL
3 Name2 file1:line1 Low
4 Name2 file1:line3 Low
and so on...
and i want my docx outpur as:
`
[1]: https://i.stack.imgur.com/1xNc0.png
I am not able to figure it out how can i filter "Instances" based on "Name" using pandas and later print then into docx.
Thanks in advance.

Below code will select the relevant columns,group by based on 'Name' and 'Severity' and add Instances together
df2 = df[["Name","Instance","Severity"]]
df2["Instance"] = df2.groupby(['Name','Severity'])['Instance'].transform(lambda x: '\n'.join(x))
Finally, remove the duplicates and transform to get the desired output
df2 = df2.drop_duplicates()
df2 = df2.T

Join 2 text file data based on a column value

How to implement a joining of 2 text files using python and output a third file but only adding values present in one file that have corresponding matching value in second file?
Input File1.txt:
GameScore|KeyNumber|Order
85|2568909|2|
84|2672828|1|
80|2689999|5|
65|2123232|3|
Input File2.txt:
KeyName|RecordNumber
Andy|2568909|
John|2672828|
Andy|2672828|
Megan|1000021|
Required Output File3.txt:
KeyName|KeyNumber|GameScore|Order
Andy|2672828|84|1|
Andy|2568909|85|2|
John|2672828|84|1|
Megan|1000021||
Look for a key name and record number in File 2 and match it with KeyNumber in file 1 and copy the corresponding game score and order values.
The files have anywhere from 1 to 500000 records so need to be able to run for a large set.
Edit: I do not have access to any libraries like pandas and not allowed to install any libraries.
Essentially need to run a cmd that will trigger a program that does the reads 2 files, compares and generates the third file.

You can use pandas to do this:
import pandas as pd
df1 = pd.read_csv('Input File1.txt', sep='|')
df2 = pd.read_csv('Input File2.txt', sep='|', header=0, names=['KeyName', 'KeyNumber'])
df3 = df1.merge(df2, on='KeyNumber', how='right')
See the documentation for fine-tuning.

Using Pandas to insert a column on an existing large CSV file without consuming so much RAM

I am reading in a large CSV file using Python Pandas that is a little over 2GB in size. Ultimately, what I am attempting to do is add a "Date" column to the first index of the file, transpose the file from 364 rows and approximately 360,000 columns to only three columns ("Date", "Location", and "Data) with many, many rows instead. This will then be written out to a newly transposed CSV file.
For a little more context, each of the 364 rows represents each day of the year. For each day (and each row), there are thousands and thousands of site locations (these are the columns), each containing a measurement taken at the location.
The file looks like this right now:
Index Location #1 Location #2 Location #359000...
0 Measurement Measurement Measurement
1 Measurement Measurement Measurement
2 Measurement Measurement Measurement
3 Measurement Measurement Measurement
364... Measurement Measurement Measurement
I have attempted to add the new column by creating a date column using the Pandas "date_range" function and then inserting that column into a new dataframe.
import pandas as pd
#read in csv file
df = pd.read_csv('Path to file')
#define 'Date' column
date_col = pd.date_range(start='1/1/2001', periods=364, freq='D')
#add 'Date' column at the 0th index to be the first column
df1 = df.insert(0, 'Date', date_col)
#rearrange rows to columns and index by date
long_df = df.set_index('Date').unstack().to_frame('Data').swaplevel().sort_index
#write out to new csv file, specifying two other columns via index label
long_df.to_csv('Transposed_csv_file', index=True, index_label=['Location', 'Data'])
The output I am looking for is a transposed CSV file that looks like this:
Date Location Data
1/1/2001 Location No. 1 Measurement 1
1/1/2001 Location No. 2 Measurement 2
1/1/2001 Location No. 3 Measurement 3
Once January 1st is complete, it will move on to January 2nd, like so:
1/2/2001 Location No. 1 Measurement 1
1/2/2001 Location No. 2 Measurement 2
1/2/2001 Location No. 3 Measurement 3
This pattern will repeat all the way to the end at 12/31/2001.
Three columns -- many rows. Basically, I am transposing from an X position to a Y position formatted CSV file.
What's happening right now is that when I attempted to run these lines of code, I noticed via the task manager that my memory was slowly being consumed, reaching beyond 96%. I have 32GB of RAM. There is no way a 2GB CSV file being read in by Pandas and outputting another large transposed file should consume that much memory. I'm not sure what I am doing wrong or if there is a better method I can use to achieve the results I want. Thank you for your help.

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376

Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

parsing CSV to pandas dataframes (one-to-many unmunge)

I have a csv file imported to a pandas dataframe. It probably came from a database export that combined a one-to-many parent and detail table. The format of the csv file is as follows:
header1, header2, header3, header4, header5, header6
sample1, property1,,,average1,average2
,,detail1,detail2,,
,,detail1,detail2,,
,,detail1,detail2,,
sample2, ...
,,detail1,detail2,,
,,detail1,detail2,,
...
(i.e. line 0 is the header, line 1 is record 1, lines 2 through n are details, line n+1 is record 2 and so on...)
What is the best way to extricate (renormalize?) the details into separate DataFrames that can be referenced using values in the sample# records? The number of each subset of details are different for each sample.
I can use:
samplelist = df.header2[pd.notnull(df.header2)]
to get the starting index of each sample so that I can grab samplelist.index[0] to samplelist.index[1] and put it in a smaller dataframe. Detail records by themselves have no reference to which sample they came from, so that has to be inferred from the order of the csv file (notice that there is no intersection of filled/empty fields in my example).
Should I make a list of dataframes, a dict of dataframes, or a panel of dataframes?
Can I somehow create variables from the sample1 record fields and somehow attach them to each dataframe that has only detail records (like a collection of objects that have several scalar members and one dataframe each)?
Eventually I will create statistics on data from each detail record grouping and plot them against values in the sample records (e.g. sampletype, day or date, etc. vs. mystatistic). I will create intermediate Series to also be attached to the sample grouping like a kernel density estimation PDF or histogram.
Thanks.

You can use the fact that the first column seems to be empty unless it's a new sample record to .fillna(method='ffill') and then .groupby('header1') to get all the separate groups. On these, you can calculate statistics right away or store as separate DataFrame. High level sketch as follows:
df.header1 = df.header1.fillna(method='ffill')
for sample, data in df.groupby('header1'):
print(sample) # access to sample name
data = ... # process sample records

The answer above got me going in the right direction. With further work, the following was used. It turns out I needed to use two columns as a compound key to uniquely identify samples.
df.header1 = df.header1.fillna(method='ffill')
df.header2 = df.header2.fillna(method='ffill')
grouped = df.groupby(['header1','header2'])
samplelist = []
dfParent = pd.DataFrame()
dfDetail = pd.DataFrame()
for sample, data in grouped:
samplelist.append(sample)
dfParent = dfParent.append(grouped.get_group(sample).head(n=1), ignore_index=True)
dfDetail = dfDetail.append(data[1:], ignore_index=True)
dfParent = dfParent.drop(['header3','header4',etc...]) # remove columns only used in
# detail records
dfDetail = dfDetail.drop(['header5','header6',etc...]) # remove columns only used once
# per sample
# Now details can be extracted by sample number in the sample list
# (e.g. the first 10 for sample 0)
samplenumber = 0
dfDetail[
(dfDetail['header1'] == samplelist[samplenumber][0]) &
(dfDetail['header2'] == samplelist[samplenumber][1])
].header3[:10]
Useful links were:
Pandas groupby and get_group
Pandas append to DataFrame

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.