So I'm trying to get two CSV's to be outputted by splitting them based upon the date. however, whenever I try to to do this I get a syntax error when calling the heading title
ufp = pd.read_excel('pythoncsv/C20190723.xlsx')
ufp.head()
"Segmentation Account# Last Name First Name State Phone Alt Phone Language Therapy Last Order Date Next Order Date Customer Type"
Last order is shown as a datime64
ts = pd.to_datetime('2019-04-25')
so this is the jist of the code I want to work.
ufp.loc[ufp.'Last Order Date' >= ts, :]
but i get
File "<ipython-input-22-0e4a7bef835a>", line 1
ufp.loc[ufp.'Last Order Date' >= ts, :]
^
SyntaxError: invalid syntax
Is the way I'm phrasing the column wrong? I'm extremely new to python so it's possible I just don't get what I'm doing.
ufp.loc[ufp['Last Order Date'] >= ts, :]
This should work
Columns (when talking about them as attributes of a data frame using .column_name notation) do not require quotation marks like they would when referenced elsewhere. To access columns this way you will need to remove the spaces in the column label. For instance, you can refer to ufp.Last_Order_Date but not ufp.Last Order Date.
Try what Ramesh suggested above.
Just for completeness: starting with version 0.25.0, you can also use query with column names that contain spaces by surrounding them in backticks:
ufp.query(`Last Order Date` >= #ts)
Related
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.
I have a Pandas dataframe containing tweets. I want to count the number of tweets that have been retweeted.
This code does not work
tweets_retweeted = twitter.apply(lambda x:True if x.retweet_count > 0 else False)
count_of_tweets_retweeted = len(tweets_retweeted[tweets_retweeted == True].index)
The error message I get is
KeyError: ('retweet_count', 'occurred at index created_at')
Without having the ability to recreate your example, there are a few things that could be going on.
The error is likely coming from the 1st line where you are trying to access the column.
You may be passing one column at a time to the apply function rather than one row at a time. Please use axis = 1 to pass each row to see if it works.
Also, just a best practice (in my humble opinion) is to not reference column names with the dot notation. Try to use the bracket notation to differentiate between column names and methods.
Can you do:
j = twitter['retweet_count'] > 0
j.value_counts()
Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()
I have a set of columns that contain dates (imported from Excel file), and I need to process them as follows:
If a cell in one of those columns is blank, set another column to 1, else that column is 0. This allows me to sum all the 1's and show that those items are missing.
This is how I am doing that at present:
df_combined['CDR_Form_notfound'] = np.where(df_combined['CDR-Form'].mask(df_combined['CDR-Form'].str.len()==0).isnull(),1,0)
A problem I am having is that I have to format those columns so that A) dates are trimmed to show only the day/month/year and B) some of the columns have a value of "see notes" in them, instead of being a date or blank. That "see notes" is essential to properly accounting for missing items, it has to be there to keep the cell from flagging as empty and the item counting as missing (adding to the 'blank cells' count). The actual problem is that if I run this code before the .isnull code above, evry blank becomes a NaN or a nan or a NaT, and then NOTHING flags as null/missing.
This is the code I am using to trim the date strings and change the "see notes" to a string...because otherwise it just ends up blank in the output.
for c in df_combined[dateColumns]:
df_combined[c] = df_combined[c].astype(str) # uncomment this if columns change from dtype=str
df_combined[c] = np.where(df_combined[c].str.contains("20"), df_combined[c].str[:10], df_combined[c])
df_combined[c] = np.where(df_combined[c].str.contains("see notes"), df_combined[c].str, df_combined[c])
I think my problem might have something to do with the dtypes of the columns. When I run print(df.dtypes), every column shows as 'object', except for one I specifically set to int using this:
df_combined['Num'] = df_combined['Num'].apply(lambda x: int(x) if x == x else "")
Are you trying to count NaNs?
If yes, you can just do:
len(df.loc[:, df.isnull().any()])
I see that you mention "blank" because it is comming from excel, so what you could do is to transform these blanks into nan before running the command above using:
df['CDR-Form'].replace('', np.NaN,inplace=True)
When I look at the values in a column in my dataframe, I can see that due to user data entry errors, the same category has been entered incorrectly.
For my dataframe I use this code:
df['column_name'].value_counts()
output:
Targeted 523534
targeted 1
story 25425
story 2
multiple 2524543
For story, I guess there is a space?
I am trying to replace targeted with Targeted.
df['column_name'].replace("targeted","Targeted")
But nothing is happening, I still get the same value count.
Yes, is seems there is start of end white-space(s).
Need str.strip first and then Series.replace or Series.str.replace:
df['column_name'] = df['column_name'].str.strip().replace("targeted","Targeted")
df['column_name'] = df['column_name'].str.strip().str.replace("targeted","Targeted")
Another possible solution is convert all characters to lowercase:
df['column_name'] = df['column_name'].str.strip().str.lower()