How to convert a data frame column in python? - python

After reading from large excel I have the following data
Mode Fiscal Year/Period Amount
ABC 12.2001 10243.00
CAB 2.201 987.87
I need to convert the above data frame as below
Mode Fiscal Year/Period Amount
ABC 012.2001 10243.00
CAB 002.2010 987.87
need help in converting the Fiscal Year/Period column.

It is always easier for us and you will get better help if you provide your attempts at the solution (your code).
Try this,
import pandas as pd
Recreating your data
data = {'mode':['abc', 'cab'], 'Fiscal Year/Period':[12.2001, 2.201]}
And put it in a dataframe,
data=pd.DataFrame(data)
Convert the column to a str,
data['Fiscal Year/Period']=data['Fiscal Year/Period'].astype(str)
And use zfill() to fill with zeros
data['Fiscal Year/Period'].apply(lambda x: x.zfill(8))
yields,
0 012.2001
1 0002.201
Name: Fiscal Year/Period, dtype: object

IIUC, you can just zfill and ljust
s = df['Fiscal_Year/Period'].str.split('.',expand=True)
s[0] = s[0].str.zfill(3)
s[1] = s[1].str.ljust(4,'0')
df['Year'] = s.agg('.'.join,axis=1)
print(df)
Mode Fiscal_Year/Period Amount Year
0 ABC 12.2001 10243.00 012.2001
1 CAB 2.201 987.87 002.2010

Related

DataFrame contains a column of dates which are having these types: "'5-15-2019'" and 05152021.I want to extract pattern of it

DataFrame contains dates which are having these types: "21-10-2021" and 29052021.I want to extract pattern of it.
for example '5-15-2019',it needs to produce '%d-%m-%Y'
'05152021' it needs to produce '%d%m%Y'
i tried in this way:
search6=[]
for val in list(df.apply(lambda x:re.search('(?:[1-9]|[12][0-9]|3[01])[-](?:[1-9]|10|11|12])[-]\d{2,4}',str(x)))):
if val:
li=val.group()
search6.append(li)
print(search6)
output: i got a list of those patterns.i need to get pattern '%d-%m-%Y' and Similarly i need to get pattern for '%d%m%Y' also.how i need to do it? can any body help me.Thank you
You can use the internal pandas method pandas._libs.tslibs.parsing.guess_datetime_format. Be careful, this is not part of the public API, so the function might change without any warning in the future.
option 1
from pandas._libs.tslibs.parsing import guess_datetime_format
s = pd.Series(['21-10-2021', '29052021', '5-15-2019', '05152021', '20000101', '01-01-2001'])
s.map(lambda x: guess_datetime_format(x, dayfirst=True))
option 2
....YYYY dates are not supported. For those you need to cheat by adding dashes temporarily:
def parse(x):
out = guess_datetime_format(x, dayfirst=True)
if out is None and x.isdigit() and len(x)==8:
out = (guess_datetime_format(f'{x[:2]}-{x[2:4]}-{x[4:]}',
dayfirst=True)
.replace('-', '')
)
return out
s.map(parse)
Example:
date option1 option2
0 21-10-2021 %d-%m-%Y %d-%m-%Y
1 29052021 None %d%m%Y
2 5-15-2019 %m-%d-%Y %m-%d-%Y
3 05152021 None %m%d%Y
4 20000101 %Y%m%d %Y%m%d
5 01-01-2001 %d-%m-%Y %d-%m-%Y

Extract columns from string

I have a pandas df column containing the following strings:
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
I made some tests with str.extract but couldn't get what I want.
Any ideas on how to achieve it?
Thanks
You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':
df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]
One solution using pandas.Series.str.extract (as you tried using it):
>>> df
col
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
In the above, I assume that:
Type takes the two values Future or Stock
conId consists of digits
symbol consists of capital alphabet letters
localSymbol consists of digits and capital alphabet letters
You may want to adapt the pattern to better fit your needs.

ValueError: Unable to parse string "15,181.80" at position 0

I am trying to convert a df to all numeric values but getting the following error.
ValueError: Unable to parse string "15,181.80" at position 0
Here is my current code:
data = pd.read_csv('pub?gid=1704010735&single=true&output=csv',
usecols=[0,1,2],
header=0,
encoding="utf-8-sig",
index_col='Date')
data.apply(pd.to_numeric)
print("we have a total of:", len(data), " samples")
data.head()
And df before I am trying to convert:
Clicks Impressions
Date
01/03/2020 15,181.80 1.22%
02/03/2020 12,270.76 0.56%
03/03/2020 39,420.79 0.80%
04/03/2020 22,223.97 0.79%
05/03/2020 17,084.45 0.88%
I think the issue is that it handle the special characters E.G. "," - is this correct? What is the best recommendation to help convert the DF into all numeric values?
Thanks!
Deleting all , in the numbers of your dataframe will fixe your problem.
This is the code I used:
import pandas as pd
df = pd.DataFrame({'value':['10,000.23','20,000.30','10,000.10']})
df['value'] = df['value'].str.replace(',', '').astype(float)
df.apply(pd.to_numeric)
OUTPUT:
value
0 10000.23
1 20000.30
2 10000.10
EDIT:
You can use also:
df= df.value.str.replace(',', '').astype(float)
The value is the column that you want to convert

Is it possible to transform a CSV file with date as index?

I currently try to convert a CSV with python3 to a new format.
My later goal is to add some information to this file with pandas.
Thinks like "is the date a weekday or weekend?".
To achieve this, however, I have to overcome the first hurdle.
I need to transform my CSV file from this:
date,hour,price
2018-10-01,0-1,59.53
2018-10-01,1-2,56.10
2018-10-01,2-3,51.41
2018-10-01,3-4,47.38
2018-10-01,4-5,47.59
2018-10-01,5-6,51.61
2018-10-01,6-7,69.13
2018-10-01,7-8,77.32
2018-10-01,8-9,84.97
2018-10-01,9-10,79.56
2018-10-01,10-11,73.70
2018-10-01,11-12,71.63
2018-10-01,12-13,63.15
2018-10-01,13-14,60.24
2018-10-01,14-15,56.18
2018-10-01,15-16,53.00
2018-10-01,16-17,53.37
2018-10-01,17-18,60.42
2018-10-01,18-19,69.93
2018-10-01,19-20,75.00
2018-10-01,20-21,65.83
2018-10-01,21-22,53.86
2018-10-01,22-23,46.46
2018-10-01,23-24,42.50
2018-10-02,0-1,45.10
2018-10-02,1-2,44.10
2018-10-02,2-3,44.06
2018-10-02,3-4,43.70
2018-10-02,4-5,44.29
2018-10-02,5-6,48.13
2018-10-02,6-7,57.70
2018-10-02,7-8,68.21
2018-10-02,8-9,70.36
2018-10-02,9-10,54.53
2018-10-02,10-11,48.49
2018-10-02,11-12,46.19
2018-10-02,12-13,44.15
2018-10-02,13-14,30.79
2018-10-02,14-15,27.75
2018-10-02,15-16,30.74
2018-10-02,16-17,26.77
2018-10-02,17-18,38.68
2018-10-02,18-19,48.52
2018-10-02,19-20,49.03
2018-10-02,20-21,45.43
2018-10-02,21-22,32.04
2018-10-02,22-23,26.22
2018-10-02,23-24,1.08
2018-10-03,0-1,2.13
2018-10-03,1-2,0.10
...
to this:
date,0-1,1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,...,23-24
2018-10-01,59.53,56.10,51.41,47.38,47.59,51.61,69.13,77.32,84.97,...,42.50
2018-10-02,45.10,44.10,44.06,43.70,44.29,....
2018-10.03,2.13,0.10,....
...
I've tried a lot with pandas DataFrames, but I can't come up with a solution.
import numpy as np
import pandas as pd
df = pd.read_csv('file.csv')
df
date hour price
0 2018-10-01 0-1 59.53
1 2018-10-01 1-2 56.10
2 2018-10-01 2-3 51.41
3 2018-10-01 3-4 47.38
4 2018-10-01 4-5 47.59
5 2018-10-01 5-6 51.61
6 2018-10-01 6-7 69.13
7 2018-10-01 7-8 77.32
8 2018-10-01 8-9 84.97
The DataFrame should look like this.
But I don't manage to fill the DataFrame.
df = pd.DataFrame(df, index=['date'], columns=['date','0-1','1-2','2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10', '10-11', '11-12', '12-13', '13-14', '14-15', '15-16', '16-17', '17-18', '18-19', '19-20', '20-21', '21-22', '22-23', '23-24'])
How would you solve this?
You can use pandas.DataFrame.unstack():
# pivot the dataframe with hour to the columns
df1 = df.set_index(['date','hour']).unstack(1)
# drop level-0 on columns
df1.columns = [ c[1] for c in df1.columns ]
# sort the column names by numeric order of hours (the number before '-')
df1 = df1.reindex(columns=sorted(df1.columns, key=lambda x: int(x.split('-')[0]))).reset_index()
If I understand correctly, try using the index_col argument of pd.read_csv(), using integer labelling for the columns in the file:
df = pd.read_csv('file.csv', index_col=0)
read_csv docs here; don't be put off by the alarming number of keyword arguments, one of them will often do what you need!
You may need to parse the first two columns as a date, then add a column for weekend based on a condition on the result. See the parse_dates and infer_datetime_format keyword arguments.

Replacing pandas column with a subset of itself through regex

I have data streaming in the following format:
from StringIO import StringIO
data ="""\
ANI/IP
sip:5554447777#10.94.2.15
sip:10.66.7.34#6665554444
sip:3337775555#10.94.2.11
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str')
What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)
However, this results in:
.....print(df)
ANI/IP
0 sip:#10.94.2.15
1 sip:#10.66.7.34
2 sip:#10.94.2.11
I need the phone numbers, so how do I achieve this? :
ANI/IP
0 5554447777
1 6665554444
2 3337775555
The regex \d{10} searches for substring of digits precisely 10 characters long.
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).
You want to extract these phone numbers:
In [11]: df["ANI/IP"].str.extract(r'(\d{10})') # before overwriting!
Out[11]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
Set this as another column and you're away:
In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')
You could use pandas.core.strings.StringMethods.extract to extract
In [10]: df['ANI/IP'].str.extract("(\d{10})")
Out[10]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object

Categories