DataFrame contains dates which are having these types: "21-10-2021" and 29052021.I want to extract pattern of it.
for example '5-15-2019',it needs to produce '%d-%m-%Y'
'05152021' it needs to produce '%d%m%Y'
i tried in this way:
search6=[]
for val in list(df.apply(lambda x:re.search('(?:[1-9]|[12][0-9]|3[01])[-](?:[1-9]|10|11|12])[-]\d{2,4}',str(x)))):
if val:
li=val.group()
search6.append(li)
print(search6)
output: i got a list of those patterns.i need to get pattern '%d-%m-%Y' and Similarly i need to get pattern for '%d%m%Y' also.how i need to do it? can any body help me.Thank you
You can use the internal pandas method pandas._libs.tslibs.parsing.guess_datetime_format. Be careful, this is not part of the public API, so the function might change without any warning in the future.
option 1
from pandas._libs.tslibs.parsing import guess_datetime_format
s = pd.Series(['21-10-2021', '29052021', '5-15-2019', '05152021', '20000101', '01-01-2001'])
s.map(lambda x: guess_datetime_format(x, dayfirst=True))
option 2
....YYYY dates are not supported. For those you need to cheat by adding dashes temporarily:
def parse(x):
out = guess_datetime_format(x, dayfirst=True)
if out is None and x.isdigit() and len(x)==8:
out = (guess_datetime_format(f'{x[:2]}-{x[2:4]}-{x[4:]}',
dayfirst=True)
.replace('-', '')
)
return out
s.map(parse)
Example:
date option1 option2
0 21-10-2021 %d-%m-%Y %d-%m-%Y
1 29052021 None %d%m%Y
2 5-15-2019 %m-%d-%Y %m-%d-%Y
3 05152021 None %m%d%Y
4 20000101 %Y%m%d %Y%m%d
5 01-01-2001 %d-%m-%Y %d-%m-%Y
I have a pandas df column containing the following strings:
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
I made some tests with str.extract but couldn't get what I want.
Any ideas on how to achieve it?
Thanks
You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':
df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]
One solution using pandas.Series.str.extract (as you tried using it):
>>> df
col
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
In the above, I assume that:
Type takes the two values Future or Stock
conId consists of digits
symbol consists of capital alphabet letters
localSymbol consists of digits and capital alphabet letters
You may want to adapt the pattern to better fit your needs.
I am trying to convert a df to all numeric values but getting the following error.
ValueError: Unable to parse string "15,181.80" at position 0
Here is my current code:
data = pd.read_csv('pub?gid=1704010735&single=true&output=csv',
usecols=[0,1,2],
header=0,
encoding="utf-8-sig",
index_col='Date')
data.apply(pd.to_numeric)
print("we have a total of:", len(data), " samples")
data.head()
And df before I am trying to convert:
Clicks Impressions
Date
01/03/2020 15,181.80 1.22%
02/03/2020 12,270.76 0.56%
03/03/2020 39,420.79 0.80%
04/03/2020 22,223.97 0.79%
05/03/2020 17,084.45 0.88%
I think the issue is that it handle the special characters E.G. "," - is this correct? What is the best recommendation to help convert the DF into all numeric values?
Thanks!
Deleting all , in the numbers of your dataframe will fixe your problem.
This is the code I used:
import pandas as pd
df = pd.DataFrame({'value':['10,000.23','20,000.30','10,000.10']})
df['value'] = df['value'].str.replace(',', '').astype(float)
df.apply(pd.to_numeric)
OUTPUT:
value
0 10000.23
1 20000.30
2 10000.10
EDIT:
You can use also:
df= df.value.str.replace(',', '').astype(float)
The value is the column that you want to convert
I currently try to convert a CSV with python3 to a new format.
My later goal is to add some information to this file with pandas.
Thinks like "is the date a weekday or weekend?".
To achieve this, however, I have to overcome the first hurdle.
I need to transform my CSV file from this:
date,hour,price
2018-10-01,0-1,59.53
2018-10-01,1-2,56.10
2018-10-01,2-3,51.41
2018-10-01,3-4,47.38
2018-10-01,4-5,47.59
2018-10-01,5-6,51.61
2018-10-01,6-7,69.13
2018-10-01,7-8,77.32
2018-10-01,8-9,84.97
2018-10-01,9-10,79.56
2018-10-01,10-11,73.70
2018-10-01,11-12,71.63
2018-10-01,12-13,63.15
2018-10-01,13-14,60.24
2018-10-01,14-15,56.18
2018-10-01,15-16,53.00
2018-10-01,16-17,53.37
2018-10-01,17-18,60.42
2018-10-01,18-19,69.93
2018-10-01,19-20,75.00
2018-10-01,20-21,65.83
2018-10-01,21-22,53.86
2018-10-01,22-23,46.46
2018-10-01,23-24,42.50
2018-10-02,0-1,45.10
2018-10-02,1-2,44.10
2018-10-02,2-3,44.06
2018-10-02,3-4,43.70
2018-10-02,4-5,44.29
2018-10-02,5-6,48.13
2018-10-02,6-7,57.70
2018-10-02,7-8,68.21
2018-10-02,8-9,70.36
2018-10-02,9-10,54.53
2018-10-02,10-11,48.49
2018-10-02,11-12,46.19
2018-10-02,12-13,44.15
2018-10-02,13-14,30.79
2018-10-02,14-15,27.75
2018-10-02,15-16,30.74
2018-10-02,16-17,26.77
2018-10-02,17-18,38.68
2018-10-02,18-19,48.52
2018-10-02,19-20,49.03
2018-10-02,20-21,45.43
2018-10-02,21-22,32.04
2018-10-02,22-23,26.22
2018-10-02,23-24,1.08
2018-10-03,0-1,2.13
2018-10-03,1-2,0.10
...
to this:
date,0-1,1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,...,23-24
2018-10-01,59.53,56.10,51.41,47.38,47.59,51.61,69.13,77.32,84.97,...,42.50
2018-10-02,45.10,44.10,44.06,43.70,44.29,....
2018-10.03,2.13,0.10,....
...
I've tried a lot with pandas DataFrames, but I can't come up with a solution.
import numpy as np
import pandas as pd
df = pd.read_csv('file.csv')
df
date hour price
0 2018-10-01 0-1 59.53
1 2018-10-01 1-2 56.10
2 2018-10-01 2-3 51.41
3 2018-10-01 3-4 47.38
4 2018-10-01 4-5 47.59
5 2018-10-01 5-6 51.61
6 2018-10-01 6-7 69.13
7 2018-10-01 7-8 77.32
8 2018-10-01 8-9 84.97
The DataFrame should look like this.
But I don't manage to fill the DataFrame.
df = pd.DataFrame(df, index=['date'], columns=['date','0-1','1-2','2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10', '10-11', '11-12', '12-13', '13-14', '14-15', '15-16', '16-17', '17-18', '18-19', '19-20', '20-21', '21-22', '22-23', '23-24'])
How would you solve this?
You can use pandas.DataFrame.unstack():
# pivot the dataframe with hour to the columns
df1 = df.set_index(['date','hour']).unstack(1)
# drop level-0 on columns
df1.columns = [ c[1] for c in df1.columns ]
# sort the column names by numeric order of hours (the number before '-')
df1 = df1.reindex(columns=sorted(df1.columns, key=lambda x: int(x.split('-')[0]))).reset_index()
If I understand correctly, try using the index_col argument of pd.read_csv(), using integer labelling for the columns in the file:
df = pd.read_csv('file.csv', index_col=0)
read_csv docs here; don't be put off by the alarming number of keyword arguments, one of them will often do what you need!
You may need to parse the first two columns as a date, then add a column for weekend based on a condition on the result. See the parse_dates and infer_datetime_format keyword arguments.
I have data streaming in the following format:
from StringIO import StringIO
data ="""\
ANI/IP
sip:5554447777#10.94.2.15
sip:10.66.7.34#6665554444
sip:3337775555#10.94.2.11
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str')
What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)
However, this results in:
.....print(df)
ANI/IP
0 sip:#10.94.2.15
1 sip:#10.66.7.34
2 sip:#10.94.2.11
I need the phone numbers, so how do I achieve this? :
ANI/IP
0 5554447777
1 6665554444
2 3337775555
The regex \d{10} searches for substring of digits precisely 10 characters long.
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).
You want to extract these phone numbers:
In [11]: df["ANI/IP"].str.extract(r'(\d{10})') # before overwriting!
Out[11]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
Set this as another column and you're away:
In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')
You could use pandas.core.strings.StringMethods.extract to extract
In [10]: df['ANI/IP'].str.extract("(\d{10})")
Out[10]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object