Loop through column until blank - python

I am trying to loop through a panda DataFrame until the columns are blank or does not contain the term 'Stock'. If it contains a date I want the word 'check' to be printed.
I am using:
print(df)
Stock 15/12/2015 15/11/2015 15/10/2015
0 AA 10 11 11
1 BB 20 10 8
2 CC 30 33 26
3 DD 40 80 60
I have tried the below (which is wrong):
column = df
while column != ("") or 'Stock':
print ('Check'),
column += 1
print ("")

There's a few problems in your code. First of all you've screwed up indentation so it's not even valid code.
Second your comparison is broken because it doesn't mean what you probably expect. column != ("") or 'Stock' will always be true because it means that first it will compare column with ("") and if that's equal the expression will be True, otherwise it will evaluate 'Stock' and make that the value of the expression (and in boolean context that would be considered true). What you probably should have written instead is column != "" and column != "Stock" or possibly column not in ("", "Stock").
Then I'm not sure if you're looping the right way or using column the right way either. Is it correct to step to the next by using column += 1? I don't know panda, but it seems odd. Also comparing it to a string may be incorrect.

Your code really needs improvement. You should follow #skyking advises. I'd like to add that you may want to transpose your dataframe, and put the dates as a variable.
Anyway, let me rephrase what you are looking for, to make sure I got it right: you want to iterate over the columns of your df and for every column which name is a date, you do print('Check'), otherwise nothing happens. Please, let us know if this is wrong.
To achieve that, here is a possible approach. You can iterate over the columns name and attempt to convert the string to a date, for instance, using pd.to_datetime. If successful, it prints a message.
for name in df.columns:
print(name) # comment this line after testing
try:
pd.to_datetime(name)
except ValueError:
pass # or do something in case the column name is not a date
else:
print('Check')
This outputs
Stock
15/12/2015
Check
15/11/2015
Check
15/10/2015
Check
You can see that Check was only printed when the column name was, at least, coercible into a date.

Related

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

Printing and counting unique values from an .xlsx file

I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"
I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()
Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks

Python; count the number of tweets retweeted

I have a Pandas dataframe containing tweets. I want to count the number of tweets that have been retweeted.
This code does not work
tweets_retweeted = twitter.apply(lambda x:True if x.retweet_count > 0 else False)
count_of_tweets_retweeted = len(tweets_retweeted[tweets_retweeted == True].index)
The error message I get is
KeyError: ('retweet_count', 'occurred at index created_at')
Without having the ability to recreate your example, there are a few things that could be going on.
The error is likely coming from the 1st line where you are trying to access the column.
You may be passing one column at a time to the apply function rather than one row at a time. Please use axis = 1 to pass each row to see if it works.
Also, just a best practice (in my humble opinion) is to not reference column names with the dot notation. Try to use the bracket notation to differentiate between column names and methods.
Can you do:
j = twitter['retweet_count'] > 0
j.value_counts()

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

Add 1 where a substring is present in a column

I have several strings concatenated in column rows, separated by '|'.
I need to make columns for each of the strings. so applyed a unique method and now have an arrange with the desired strings, let's call it A.
i made columns checking if the string in the A is in the concatenated row of column C with this:
for i in A:
df[i] = df['C'].str.contains(i)
now this returns booleans, and now I'm trying to turn booleans into 1 and 0 values. The target is to make columns that tell if the string A is in the concatenated strings C.
So, is there a way to make it return values 1 for True and 0 for False? i'm asking also because i couldn't test much because A has 20 strings and C 20 million rows, so I have to let my laptop run this by night :P
hope my english is clear, thank's!
Assuming I have understood your question well you are actually just one function away:
for i in A:
df[i] = df['C'].str.contains(i).astype(int)
However, if you already have computed the Boolean values you can:
df[A] = df[A].astype(int)
or
df[A]=df[A].replace({True:1, False:0})

Categories