I am trying to convert a df to all numeric values but getting the following error.
ValueError: Unable to parse string "15,181.80" at position 0
Here is my current code:
data = pd.read_csv('pub?gid=1704010735&single=true&output=csv',
usecols=[0,1,2],
header=0,
encoding="utf-8-sig",
index_col='Date')
data.apply(pd.to_numeric)
print("we have a total of:", len(data), " samples")
data.head()
And df before I am trying to convert:
Clicks Impressions
Date
01/03/2020 15,181.80 1.22%
02/03/2020 12,270.76 0.56%
03/03/2020 39,420.79 0.80%
04/03/2020 22,223.97 0.79%
05/03/2020 17,084.45 0.88%
I think the issue is that it handle the special characters E.G. "," - is this correct? What is the best recommendation to help convert the DF into all numeric values?
Thanks!
Deleting all , in the numbers of your dataframe will fixe your problem.
This is the code I used:
import pandas as pd
df = pd.DataFrame({'value':['10,000.23','20,000.30','10,000.10']})
df['value'] = df['value'].str.replace(',', '').astype(float)
df.apply(pd.to_numeric)
OUTPUT:
value
0 10000.23
1 20000.30
2 10000.10
EDIT:
You can use also:
df= df.value.str.replace(',', '').astype(float)
The value is the column that you want to convert
Related
I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data.
First dataframe df:
earnDate
anncTod
103
2015-11-18
0900
104
2016-02-24
0900
105
2016-05-18
0900
...
..........
.......
128
2022-03-01
0900
129
2022-05-18
0900
130
2022-08-17
0900
Second dataframe af:
Datetime
Open
High
Low
Close
Volume
2005-01-03
36.3458
36.6770
35.5522
35.6833
3343500
...........
.........
.........
.........
........
........
2022-04-22
246.5500
247.2000
241.4300
241.9100
1817977
I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df.
I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop.
Here's the full code:
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.iloc[date,1] == '0900':
priorday = af.loc[af.index.get_loc(date)-1,0]
priorclose = af.loc[priorday,4]
open = af.loc[date,1]
close = af.loc[date,4]
df.iloc[date,2] = close/priorclose
df.iloc[date,3] = open/priorclose
df.iloc[date,4] = close/open
else:
print('afternoon')
I get an error:
if df.iloc[date,1] == '0900':
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Converting the date columns to integers creates another error. Is there a better way I should go about doing this?
Ideal output would look like (made up numbers, abbreviated output):
earnDate
anncTod
Total Move
Overnight Move
Intraday Move
2015-11-18
0900
9%
7.2%
1.8%
But would include all the dates given in the first dataframe.
UPDATE
I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot.
Here is updated and simplified code (all else remains the same):
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## set index to earnDate
df = df.set_index(pd.DatetimeIndex(df['earnDate']))
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(af.loc[date,'Open']) ##this is line generating error
else:
print('afternoon')
I now get KeyError:'2015-11-18'
To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX:
import pandas as pd
df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'],
'anncTod': ['0900', '1000', '0800'],
'Open': [111, 222, 333]})
df = df.set_index(df["earnDate"])
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(df.loc[date, 'Open'])
# prints
# 111
I have a CSV that has several columns that contain Datetime data. After removing rows that are "Invalid", I want to be able to find the Min and Max Datetime value of each row and and place this result as a new columns. However I seem to be getting a Future Warning error when I attempt to add the code for these 2 new columns.
FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
Here is my script:
import pandas as pd
import os
import glob
# Datetime columns
dt_columns = ['DT1','DT2','DT3']
df = pd.read_csv('full_list.csv', dtype=object)
# Remove "invalid" Result rows
remove_invalid = df['Result'].str.contains('Invalid')
df.drop(index=df[remove_invalid].index, inplace=True)
for eachCol in dt_columns:
df[eachCol] = pd.to_datetime(df[eachCol]).dt.strftime('%Y-%m-%d %H:%M:%S')
# Create new columns
df['MinDatetime'] = df[dt_columns].min(axis=1)
df['MaxDatetime'] = df[dt_columns].max(axis=1)
# Save as CSV
df.to_csv('test.csv', index=False)
For context, the CSV looks something like this (usually contains 100000+ rows):
Area Serial Route DT1 DT2 DT3 Result
Brazil 17763 4 13/08/2021 23:46:31 16/10/2021 14:04:27 28/10/2021 08:19:59 Confirmed
China 28345 2 15/09/2021 03:09:21 24/04/2021 09:56:34 04/05/2021 22:07:13 Confirmed
Malta 13630 5 21/03/2021 11:59:27 18/09/2021 11:03:25 02/07/2021 02:32:48 Invalid
Serbia 49478 2 12/04/2021 06:38:05 19/03/2021 03:16:47 29/06/2021 06:39:30 Confirmed
France 34732 1 29/04/2021 03:03:14 24/03/2021 01:49:48 26/04/2021 06:44:21 Invalid
Mexico 21840 3 23/11/2021 12:53:33 10/01/2022 02:42:48 29/04/2021 14:22:51 Invalid
Ukraine 20468 3 04/11/2021 18:40:44 13/11/2021 03:38:39 11/03/2021 09:09:14 Invalid
China 28830 1 07/02/2021 23:50:34 03/12/2021 14:04:32 14/07/2021 22:59:10 Confirmed
India 49641 4 02/06/2021 11:17:35 09/05/2021 13:51:55 19/01/2022 06:56:07 Confirmed
Greece 43163 3 30/11/2021 09:31:29 28/01/2021 08:52:50 12/05/2021 07:49:48 Invalid
I want the minimum and maximum value of each row as a new column but at the moment the code produces the 2 columns with no data in.
What am I doing wrong?
Better filtering without drop:
# Remove "invalid" Result rows
remove_invalid = df['Result'].str.contains('Invalid')
df = df[~remove_invalid]
You need working with datetimes, so cannot after converting use Series.dt.strftime for strings repr of datetimes:
for eachCol in dt_columns:
df[eachCol] = pd.to_datetime(df[eachCol], dayfirst=True)
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
After reading from large excel I have the following data
Mode Fiscal Year/Period Amount
ABC 12.2001 10243.00
CAB 2.201 987.87
I need to convert the above data frame as below
Mode Fiscal Year/Period Amount
ABC 012.2001 10243.00
CAB 002.2010 987.87
need help in converting the Fiscal Year/Period column.
It is always easier for us and you will get better help if you provide your attempts at the solution (your code).
Try this,
import pandas as pd
Recreating your data
data = {'mode':['abc', 'cab'], 'Fiscal Year/Period':[12.2001, 2.201]}
And put it in a dataframe,
data=pd.DataFrame(data)
Convert the column to a str,
data['Fiscal Year/Period']=data['Fiscal Year/Period'].astype(str)
And use zfill() to fill with zeros
data['Fiscal Year/Period'].apply(lambda x: x.zfill(8))
yields,
0 012.2001
1 0002.201
Name: Fiscal Year/Period, dtype: object
IIUC, you can just zfill and ljust
s = df['Fiscal_Year/Period'].str.split('.',expand=True)
s[0] = s[0].str.zfill(3)
s[1] = s[1].str.ljust(4,'0')
df['Year'] = s.agg('.'.join,axis=1)
print(df)
Mode Fiscal_Year/Period Amount Year
0 ABC 12.2001 10243.00 012.2001
1 CAB 2.201 987.87 002.2010
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line