Calling the last 20 years from csv file on python - python

I'm trying to call the last 20 years from a CSV file that has the 'date' and 'price' columns on python.
df = df[(df['Date']>datetime.Date(1999,1,1)) & (df['Date']<datetime.Date(2019,1,1))]
I was expecting the see the data for the last 20 years from 1999 to 2019 alone.

I think, you want last 20 years data dynamically whatever the max date will be,
import pandas as pd
df = pd.read_csv('./your_data.csv')
df['date_col'] = pd.to_datetime(df['date_col'])
df['year'] = df['date_col'].dt.year
last_20_years = df[df['year'] + 20 >= df['year'].max()]
last_20_years

Related

How to combine rows of pandas dataframe only for the first column?

most_wickets_2021 = pd.read_html("https://stats.espncricinfo.com/ci/engine/records/bowling/most_wickets_career.html?id=13781;type=tournament")[0]
most_wickets_2021
Link to where I got the data from
This code prints out a dataframe that looks like this: image of the dataframe
How do I make it so the player's team just shows up in the first column right next to their name, instead of showing up in every column as a new row?
Ex:
Player
Mat
Shahnawaz Dahani (Multan Sultans)
11
I didn't type every column but I hope you understand what I mean.
Using a simple for loop process then "looking ahead" to the line that has the team, here is a possible simple way you could accomplish that.
import pandas as pd
df = pd.read_html("https://stats.espncricinfo.com/ci/engine/records/bowling/most_wickets_career.html?id=13781;type=tournament")[0]
print('Player/Team ', '\t\t\t', 'Mat', '\t', 'Inns')
for x in range(len(df) - 1):
if ((x %2) == 0):
print(df['Player'][x], df['Player'][x + 1], '\t\t\t', df['Mat'][x], '\t', df['Inns'][x])
Basically, this code snippet is reading through the data frame and for every other row, it is grabbing data either from the first row which has the player stats or from the subsequent row for the team name.
Here is a sampling of the output from my terminal when this program was run.
#Una:~/Python_Programs/Cricket$ python3 Cricket.py
Player/Team Mat Inns
Shahnawaz Dahani (Multan Sultans) 11 11
Wahab Riaz (Peshawar Zalmi) 12 12
Shaheen Shah Afridi (Lahore Qalandars) 10 10
JP Faulkner (Lahore Qalandars) 6 6
Imran Tahir (Multan Sultans) 7 7
Hasan Ali (Islamabad United) 10 10
S Mahmood (Peshawar Zalmi) 5 5
Imran Khan (Multan Sultans) 7 7
Mohammad Wasim (Islamabad United) 11 11
I didn't do too much work on formatting, but that should be easily cleaned up. You can try this method out and/or wait to see if someone has a more elegant way to do this.
Hope that helps.
Regards.
import pandas as pd
data = pd.DataFrame(most_wickets_2021["Player"].values.reshape(-1,2))
res = pd.DataFrame({
"Player": data[0] + " " + data[1],
})
data = pd.DataFrame(most_wickets_2021[most_wickets_2021.columns[1:]].values[:,:][::2])
data.columns = most_wickets_2021.columns[1:]
pd.concat([res, data], axis=1)

Rolling Year Based on Condition

Hello, I have the following code:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
# Conect to Drive
from google.colab import drive
drive.mount('/content/drive')
# Read Data
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(15)
d = pd.date_range(start="2015-01-01",end="2022-01-01", freq='MS')
dates = pd.DataFrame({"DATE":d})
df["DATE"] = pd.to_datetime(df["DATE"])
df_merge = pd.merge(dates, df, how='outer', on='DATE')
The data that I am using, you could download here: DATA
What I am trying to achieve is something known as Rolling Year.
First I create this metric gruped for each category:
# ROLLING YEAR
##################################################################################################
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_merge["RY_ACTUAL"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_merge["RY_24"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_merge["RY_LAST"] = df_merge["RY_24"] - df_merge["RY_ACTUAL"]
##################################################################################################
df_merge.head(30)
And it works perfectly, ´cause if you download the file and then filter for example for "Blue" category, you could see, something like this:
Thats mean, if you stop in the row 2015-November, you could see in the column RY_ACTUAL the sum of all the values 12 records before.
Mi next goal is to create a similar column using the rollig function but with the next condition:
The column must sum all the sales of ALL the categories, as long as
the Color/Animal column is equal to Colour. For example if I am
stopped in 2016-December, it should give me the sum of ALL the sales
of the colors from 2016-January to 2016-December
This was my attempt:
df_merge.loc[(df_merge['Colour/Animal'] == 'Colour'),'Sales'].apply(f)
Cold anyone help me to code correctly this example?.
Thanks in advance comunity!!!

Import CSV file where last column has many separators [duplicate]

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

Organizing dates and holidays in a dataframe

Scenario: I have one with different columns of data, and another single dataframe with lists of dates.
Example of dataframe1:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 5
iteration2 3
iteration5 2
iteration4 22
Example of dataframe2:
iteration1 01.01.2018 26.01.2018 30.03.2018
iteration2 01.01.2018 30.03.2018 02.04.2018 25.12.2018 26.12.2018
iteration3
iteration4 01.01.2018 15.01.2018 19.02.2018
iteration5 01.01.2018 19.02.2018 30.03.2018 21.05.2018 02.07.2018 06.08.2018 03.09.2018 08.10.2018 12.11.2018
The second dataframe is a list of holidays for each of the iterations. And it will be used to fill the second column of the first dataframe
Constraints: For each iteration of the first dataframe the user will select a month and year: the script will then find the first date of that month. If that date is on the list of dates of dataframe2 for that iteration, then pick the next working date based on the program calender.
Ex: User selects January 2018, code returns 01/01/2018. For the first iteration, that date is a holiday, so pick the next workday, in this case 02/01/2018, and then input this date to all of dataframe1 corresponding to that iteration:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 02/01/2018 5
iteration2 3
iteration5 2
iteration4 22
Then move to the next iteration (some iterations will have the same calendar dates).
Code: I have tried multiple approaches so far, but could not achieve the result. The closest I think I got was with:
import pandas as pd
import datetime
import os
from os import listdir
from os.path import isfile, join
import glob
## Get Adjustments
mypath3 = "//DGMS/Desktop/Uploader_v1.xlsm"
ApplyOnDates = pd.read_excel(open(mypath3, 'rb'), sheet_name='Holidays')
# Get content
mypath = "//DGMS/Desktop/Uploaded"
all_files = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = []
contentdataframes2 = []
for f in all_files:
df = pd.read_excel(f)
df['Name'] = os.path.basename(f).split('.')[0].split('_')[0]
df['ApplyOn']= ''
mask = df.columns.str.contains('Base|Last|Fixing|Cash')
c2 = df.columns[~mask].tolist()
df = df[c2]
contentdataframes.append(df)
finalfinal = pd.concat(contentdataframes2)
for row in finalfinal.Name.itertuple():
datedatedate = datetime.datetime(2018, 01, 1)
if (pd.np.where(ApplyOnDates.Index.str.contains(finalfinal(row)).isin(datedatedate) = True:
datetouse = datedatedate + datetime.timedelta(days=1)
else:
datetouse = datedatedate
finalfinal['ApplyOn'] = datetouse
Question: Basically, my main trouble here is being able to match the rows in both dataframes and search the date in the column of the holidays dataframe. Is there a proper way to do this?
Obs: I was able to achieve a similar result directly in vba, by using the functions of excel (vlookup, match...), the problem is that doing in excel for the amount of data basically crashes the file every time.
so you want to basically merge the column of dataframe2 to dataframe1 right? Try to use merge:
newdf = pd.DataFrame.merge(dataframe1, dataframe2, left_on='iterationcount',
right_on='iterationcount', how='inner', indicator=False)
That should give you a new frame.

could not convert string to float: '7,751.30' [duplicate]

This question already has answers here:
pandas reading CSV data formatted with comma for thousands separator
(3 answers)
Closed 5 years ago.
I get the TWSE price from Taiwan Stock Exchange.
df = pd.read_csv(r'C:\Stock\TWSE.csv',encoding='Big5')
df.head()
日期 開盤指數 最高指數 最低指數 收盤指數
0 96/02/01 7,751.30 7,757.63 7,679.78 7,701.54
1 96/02/02 7,754.16 7,801.63 7,751.53 7,777.03
2 96/02/05 7,786.77 7,823.94 7,772.05 7,783.12
3 96/02/06 7,816.30 7,875.75 7,802.94 7,875.75
4 96/02/07 7,894.77 7,894.77 7,850.06 7,850.06
df.loc[0][2]
'7,757.63'
type(df.loc[0][2])
str
I want to convert the str type to float type for the purpose of plotting.
But, I can not convert them. For example:
float(df.loc[0][2])
ValueError: could not convert string to float: '7,757.63'
pd.read_csv, much like almost every other pd.read_* function, has a thousands parameter you can set to ',' to make sure that you're importing those values as floats.
The following is an illustration:
import io
import pandas as pd
txt = '日期 開盤指數 最高指數 最低指數 收盤指數\n0 96/02/01 7,751.30 7,757.63 7,679.78 7,701.54\n1 96/02/02 7,754.16 7,801.63 7,751.53 7,777.03\n2 96/02/05 7,786.77 7,823.94 7,772.05 7,783.12\n3 96/02/06 7,816.30 7,875.75 7,802.94 7,875.75\n4 96/02/07 7,894.77 7,894.77 7,850.06 7,850.06'
with io.StringIO(txt) as f:
df = pd.read_table(f, encoding='utf8', header=0, thousands=',', sep='\s+')
print(df)
Yields:
日期 開盤指數 最高指數 最低指數 收盤指數
0 96/02/01 7751.30 7757.63 7679.78 7701.54
1 96/02/02 7754.16 7801.63 7751.53 7777.03
2 96/02/05 7786.77 7823.94 7772.05 7783.12
3 96/02/06 7816.30 7875.75 7802.94 7875.75
4 96/02/07 7894.77 7894.77 7850.06 7850.06
I hope this proves helpful.
float(df.loc[0][2].replace(',',''))

Categories