How to import data from CSV file containing certain words? - python

I have a CSV file containing daily data on yields of different government bonds of varying maturities. The headers are formatted as by the country followed by the maturity of the bond, for eg UK 10Y. What I would like to do is just import all the yields for one government bond at all maturities for one date, so for example import all the UK government bond yields at a particular date. The first date is 07/01/2021.
I know I can use Pandas, but all the codes I have seen require to use usecols function when importing. I'd like to just create a function and import only the data that I want without using usecols.
Snapshot of data, UK data is further right, but format is the same

You can try:
import time
import datetime
col_to_check = "UK government bond yields"
get_after = "07/01/2021"
get_after = time.mktime(datetime.datetime.strptime(get_after, "%d/%m/%Y").timetuple())
with open("yourfile.csv", "r") as msg:
data = msg.readlines()
index_to_check = data[0].split(",").index(col_to_check)
for i, v in enumerate(data):
if i == 0:
pass
else:
date = time.mktime(datetime.datetime.strptime(v.split(",")[index_to_check], "%d/%m/%Y").timetuple())
if date > get_after:
pass
else:
data[i] = ""
print ([x for x in data if x])
This is untested code as you did not provide a sample input but in principle it should work.
You have the header name of the column you want to check, the limit date.
You get the index of the first in your csv row. You convert the limit date to timestamp integer.
Then you read your data line by line and check. If the date/timestamp is greater than your limit you pass, else you assign empty value at the corresponding index of data.
Finally you remove empty elements to get the final list.

Related

Python - Write a row into an array into a text file

I have to work on a flat file (size > 500 Mo) and I need to create to split file on one criterion.
My original file as this structure (simplified):
JournalCode|JournalLib|EcritureNum|EcritureDate|CompteNum|
I need to create to file depending on the first digit from 'CompteNum'.
I have started my code as well
import sys
import pandas as pd
import numpy as np
import datetime
C_FILE_SEP = "|"
def main(fic):
pd.options.display.float_format = '{:,.2f}'.format
FileFec = pd.read_csv(fic, C_FILE_SEP, encoding= 'unicode_escape')
It seems ok, my concern is to create my 2 files based on criteria. I have tried with unsuccess.
TargetFec = 'Target_'+fic+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+'.txt'
target = open(TargetFec, 'w')
FileFec = FileFec.astype(convert_dict)
for row in FileFec.iterrows():
Fec_Cpt = str(FileFec['CompteNum'])
nb = len(Fec_Cpt)
if (nb > 7):
target.write(str(row))
target.close()
the result of my target file is not like I expected:
(0, JournalCode OUVERT
JournalLib JOURNAL D'OUVERTURE
EcritureNum XXXXXXXXXX
EcritureDate 20190101
CompteNum 101300
CompteLib CAPITAL SOUSCRIT
CompAuxNum
CompAuxLib
PieceRef XXXXXXXXXX
PieceDate 20190101
EcritureLib A NOUVEAU
Debit 000000000000,00
Credit 000038188458,00
EcritureLet NaN
DateLet NaN
ValidDate 20190101
Montantdevise
Idevise
CodeEtbt 100
Unnamed: 19 NaN
And I expected to obtain line into my target file when CompteNum(0:1) > 7
I have read many posts for 2 days, please some help will be perfect.
There is a sample of my data available here
Philippe
Suiting the rules and the desired format, you can use logic like:
# criteria:
verify = df['CompteNum'].apply(lambda number: str(number)[0] == '8' or str(number)[0] == '9')
# saving the dataframes:
df[verify].to_csv('c:/users/jack/desktop/meets-criterios.csv', sep = '|', index = False)
Original comment:
As I understand it, you want to filter the imported dataframe according to some criteria. You can work directly on the pandas you imported. Look:
# criteria:
verify = df['CompteNum'].apply(lambda number: len(str(number)) > 7)
# filtering the dataframe based on the given criteria:
df[verify] # meets the criteria
df[~verify] # does not meet the criteria
# saving the dataframes:
df[verify].to_csv('<your path>/meets-criterios.csv')
df[~verify].to_csv('<your path>/not-meets-criterios.csv')
Once you have the filtered dataframes, you can save them or convert them to other objects, such as dictionaries.

How to match asset price data from a csv file to another csv file with relevant news by date

I am researching the impact of news article sentiment related to a financial instrument and its potenatial effect on its instruments's price. I have tried to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of an instrument at that time, and at several itervals after that time, in our case t+2. However, program created twoM to the file, but does not return any calculated price changes
Previously, I used Reuters Eikon and its functions to conduct the research, described in the article below.
https://developers.refinitiv.com/article/introduction-news-sentiment-analysis-eikon-data-apis-python-example
However, instead of using data available from Eikon, I would like to use my own csv news file with my own price data from another csv file. I am trying to match the
excel_file = 'C:\\Users\\Artur\\PycharmProjects\\JRA\\sentimenteikonexcel.xlsx'
df = pd.read_excel(excel_file)
sentiment = df.Sentiment
print(sentiment)
start = df['GMT'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
end = df['GMT'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
spot_data = 'C:\\Users\\Artur\\Desktop\\stocksss.csv'
spot_price_10 = pd.read_csv(spot_data)
print(spot_price_10)
df['twoM'] = np.nan
for idx, newsDate in enumerate(df['GMT'].values):
sTime = df['GMT'][idx]
sTime = sTime.replace(second=0, microsecond=0)
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df['twoM'][idx] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
print(df)
However, the programm is not able to return the twoM price change values
I assume that you got a warning because you are trying to make changes on views. As soon as you have 2 [] (one for the column, one for the row) you can only read. You must use loc or iloc to write a value:
...
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df.loc[idx,'twoM'] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
...

Python: Remove all lines of csv files that do not match within a range

I have been given two sets of data in the form of csv files which have 23 columns and thousands of lines of data.
The data in column 14 corresponds to the positions of stars in an image of a galaxy.
The issue is that one set of data contains values for positions that do not exist in the second set of data. They need to both contain the same positions, but the positions are off by a value of 0.0002 each data set.
F435.csv has values which are 0.0002 greater than the values in F550.csv. I am trying to find the matches between the two files, but within a certain range because all values are off by a certain amount.
Then, I need to delete all lines of data that correspond to values that do not match.
Below is a sample of the data from each of the two files:
F435W.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88
3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78
4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68
5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01
F550M.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE,,FALSE
2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04,,
3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98,,
4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61,,
5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88
Below is the code I have so far, but I'm unsure how to find matches based on that specific column. I am very new to Python, and this task is probably way beyond my knowledge of Python, but I desperately need to figure it out. I've been working on this single task for weeks, trying different methods. Thank you in advance!
import csv
with open('F435W.csv') as csvF435:
readCSV = csv.reader(csvF435, delimiter=',')
with open('F550M.csv') as csvF550:
readCSV = csv.reader(csvF550, delimiter=',')
for x in range (0,6348):
a = csvF435[x]
for y in range(0,6349):
b = csvF550[y]
if b < a + 0.0002 and b > a - 0.0002:
newlist.append(b)
break
You can use the following sample:
import csv
def isfloat(value):
try:
float(value)
return True
except ValueError:
return False
interval = 0.0002
with open('F435W.csv') as csvF435:
csvF435_in = csv.reader(csvF435, delimiter=',')
#clean the file content before processing
with open("merge.csv","w") as merge_out:
pass
with open("merge.csv", "a") as merge_out:
#write the header of the output csv file
for header in csvF435_in:
merge_out.write(','.join(header)+'\n')
break
for l435 in csvF435_in:
with open('F550M.csv') as csvF550:
csvF550_in = csv.reader(csvF550, delimiter=',')
for l550 in csvF550_in:
if isfloat(l435[13]) and isfloat(l550[13]) and abs(float(l435[13])-float(l550[13])) < interval:
merge_out.write(','.join(l435)+'\n')
F435W.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88
3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78
4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68
5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01
F550M.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE,,FALSE
2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04,,
3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98,,
4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61,,
5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88
merge.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88

Convert rows of text into pandas structure

I have thousands of rows in a list like the one below that I would like to convert into a pandas table consisting of different columns.
2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe
Ideally I would like to to create a pandas structure with the following columns: Date, Sale, ID, Region, and then fill it with values that fit the values.
E.g. so in the first row I have sales = 120, ID = 534343, region = North America and date = 2018-12-03 21:15:24.
Given that I have thousands of rows, what code could make this work?
Supposing your list is in a file, read it first into a string (or into a list already, in which case following code will differ) and then apply code.
To read into a string:
with open('/file/path/myfile.txt','r') as f:
s = f.read()
Code for parsing:
import re
import pandas as pd
s = """2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe"""
sales_re = "Sales:([0-9]+)"
id_re = "ID:([0-9]+)"
lst = []
for line in s.split('\n'):
date = line[0:19]
sale = re.search(sales_re, line).groups()[0]
id = re.search(id_re, line).groups()[0]
region = line[line.rfind(":")+1+len(id)+1:] # Search from last ":", add one to go over ":" and 1 to skip space
x = [date, sale, id, region]
lst.append(x)
df = pd.DataFrame(lst)
df.columns = ['date', 'sale', 'id', 'region']
In the example above, I assume everything is loaded into a string. Then I use regular expressions to extract harder part of each line and append everything into a list that. Then I use the pandas.DataFrame constructor to convert into a dataframe.

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Categories