Using specific dataframe cell values to create a consolidated new dataframe

Using specific dataframe cell values to create a consolidated new dataframe - python

So this is an assignment from my python class. We have a dataframe that contains information about a machine. Every minute the machine stands still (as in: no product is being produced for whatever reason) it creates a new row in a dataframe containing the reason.
This is what the first standstill looks like in the DF (text looks weird because I had to translate it from german):
We see that Reason 1 for the standstill started at 2020-03-02 at 14:04 and was rectified at 14:07.
The assignment now is to create a new dataframe to consolidate this information, so that it would look somewhat like this:
I had the idea to use .shift() in order to check for the beginning and the end of a new Standstillreason. The column "SAP?" (Same as previous?) checks wether the Standstillreason in the current row is the same as in the previous. Same goes for "SAN?" (Same as next?) regarding the next row.
df["SAP?"] = df.Standstillreason.eq(df.Standstillreason.shift())
df["SAN?"] = df.Standstillreason.eq(df.Standstillreason.shift(periods=-1))
Everytime SAP = False, it means that the row contains the start of a new Standstillreason.
If SAN = False, the row contains the end of the current reason.
What I'm trying to do now is to extract the needed information to put it in the consolidated df. I thought of somewhat like this:
for index, row in df.iterrows():
if row["SAP?"] == False:
df_consolidated["Standstillreason"] = row["Standstillreason"]
But so far this hasn't worked. I'm not even sure wether I'm having the right approach to the problem.

If the reasons are all different, then simply
df.groupby("Standstillreason").agg({"Date":["first","last"], "Time":["first", "last"]})
Should work.
If not, it is a bit more complicated. You want to create too dataframes, one with SAP equal to true, and the other with SAN true, and get the first from the second df, and the last from the first, then concat together.

Related

How to delete specific cells in a pandas dataframe giving some conditions?

I want to clear the contents of the first two cells in location for every first 2 duplicates in last name.
For eg: i want to clear out the 1st 2 location occurances for Balchuinas and only keep the 3rd one. Same goes for London and Fleck. I ONLY want to clear out the location cells, not complete rows.
Any help?
I tried the .drop_duplicates,keep='last' method but that removes the whole row. I only want to clear the contents of the cells (or change it to NaN if thats possible)
Ps. This is my first time asking a question so im not sure how to paste the image without a link. Please help!

Rather than removing the duplicate rows. I would suggest, find the duplicate values and replace it with NaN while keeping the last cell value
Something like this:
df[df.duplicated(keep='last')] = float('nan')

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".

try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.

Working with .csv data as a Pandas DataFrame, getting redundancy error when applying logic

Been working on this project all day and it's destroying me. Currently have finished web scraping and have a final .csv which contains the elements of a pandas dataframe. Working with this dataframe in a new file, and currently have the following:
df = pd.read_csv('active_homes.csv')
for i in range(len(df)):
add = df['Address'][i]
price = df['Price'][i]
if (price<100000) == True:
print(price)
'active_homes.csv' looks like this:
Address,Status,Price,Meta
"387 8th St, Burlington, CO 80807",For Sale,169500,"4bed2bath1,560sqft"
,and the resulting df's shape is (1764, 4).
This should, in theory, print the price for each iteration of price<100000.
In practice, it prints this:
I have confirmed that at each iteration of the above for loop, it is collecting the correct 'Price' and 'Address' information, and have also confirmed that at each interval the logic (price<100000) is working correctly. However, it is still doing the above. I was originally trying to just drop the rows of the dataframe that were <100000 but that wasn't doing anything. I was also trying to reassign the data to a new dataframe and it would either return an empty dataframe, or return a dataframe with duplicate data of this house (with the 'Price' of 58900).
So far, from all of that, I believe that the program is recognizing the amount of correct houses < 100000, but for some reason the assignment is sticking for the one address. It also does the same thing without assignment, as in:
for i in range(len(df)):
if (df['Price'][i]<100000) == True:
print(df['Price'][i])
Any help in identifying the error would be much appreciated.

With Pandas you try to never iterate everything in the traditional python way. Instead, you could achieve the desired result using the following method:
df = pd.read_csv('active_homes.csv')
temp_df = df[df["Price"]<100000] # initiating a new df isn't required, just a force of a habit
print(temp_df["Price"]) # displaying a series of houses that are below 100K; imo prettier print

Cannot subset the first column in a DataFrame

Im learning how to use Pandas and I've downloaded some data from Kaggle about car prices etc.
I'm trying to create a new dataframe by subsetting all the cars out that have the model "Golf".
golfs = df[df.model == "Golf"]
It does return a new dataframe but when i call it, its just empty besides the column names.
trying this:
others = df[df.model != "Golf"]
creates a new dataframe, but it has everything in it. The datatype for the column is an object. So i tried to create subsets by transmission, which is also an object.
man_trans = df[df.transmission == "Manual"]
creates a new data frame with solely Manual transmissions... I have no idea where its going wrong. I've tried subsetting all other columns but its just the first one that wont behave. Ive even tried copying and pasting the cell value directly into the code.
Ive even tried adding in:
df.reset_index()
to add in a new index as i thought that might be the problem.

The code looks like it is correct to me. If the golf dataframe is empty, it is possible you dont have any rows where df['model'] == 'Golf'?. Maybe it's =="golf" instead?
# It this doesnt work....
# golfs = df[df.model == "Golf"]
# Maybe try this (or something like this
golfs = df[df.model == "golf"]

So David was totally correct with the space, but it wasnt at the end it was at the beginning. I checked what David was saying by applying:
'''
len(df.model.iloc[798])
'''
To find out how many characters were in a cell to what I was looking for. "Golf" only has 4 but 5 was being returned.
'''
df.model = df.model.str.lstrip()
len(df.model.iloc[798])
4
'''
Thanks to the people that responded and helped me find an answer to this stupidly simple problem.

How can I create dynamic INT variables for daily datasize variations

Each day I receive many different files from different vendors, and the sizes are vastly different. I am looking for some dynamic code that will decide what is relevant across all files. I would like to think thru how to break this file into components (df1, df2, df3 for example) which will make it easier for analysis.
Basically the first 6 lines are for overall information about the store (df1).
The 2nd component is reserved for specific item sales (starting on row 9, ending in a DIFFERENT row in every file), and I'm not sure how to capture that. I have tried something along the lines of
numb = df.loc['Type of payment'].index[0] - 2
but it is bringing in the tuple instead of the row location (int). How can i save upperrange and lowerrange to be a dynamic (int) so that each day it will bring in the correct df2 data I am looking for?
The same problem exists at the bottom under "Type of payment" - you will notice that crypto is included for the 1st day but not the 2nd. I need to find a way to get a dynamic range to remove erroneous info and keep the integrity of the rest. I think finding the lowerrange will allow me to capture from that point to the end of the sheet, but I'm open to suggestions.
df = pd.read_csv('GMSALES.csv', skipfooter=2)
upperrange = df.loc['Item Number'] #brings in tuple
lowerrange = df.loc['Type of payment'] #brings in tuple
df1 = df.iloc[:,7] #this works
df2 = df.iloc[:('upperrange':'lowerrange')] # this is what I would like to get to
df3 = df.iloc[:(lowerrange:)] # this is what I would like to get to

Your organizational problem is that your data comes in as a spreadsheet that is used for physical organization more than functional organization. The "columns" are merely typographical tabs. The file contains several types of heterogeneous data; you are right in wanting to reorganize this into individual data frames.
Very simply, you need to parse the file, customer by customer -- either before or after reading it into RAM.
From your current organization, this involves simply scanning the "df2" range of your heterogeneous data frame. I think that the simplest way is to start from row 7 and look for "Item Number" in column A; that is your row of column names. Then scan until you find a row with nothing in column A; back up one row, and that gives you lowerrange.
Repeat with the payments: find the next row with "Type of payment". I will assume that you have some way to discriminate payment types from fake data, such as a list of legal payment types (strings). Scan from "Type of Payment" until you find a row with something other than a legal payment type; the previous row is your lowerrange for df3.
Can you take it form there?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using specific dataframe cell values to create a consolidated new dataframe - python

Related

How to delete specific cells in a pandas dataframe giving some conditions?

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

Working with .csv data as a Pandas DataFrame, getting redundancy error when applying logic

Cannot subset the first column in a DataFrame

How can I create dynamic INT variables for daily datasize variations

Categories

Resources