Trying to read a .txt file into my Jupyter notebook.
This is my code:
acm = pd.read_csv('outputacm.txt', header=None, error_bad_lines=False)
print(acm)
Here is a sample of my txt file:
2244018
#*OQL[C++]: Extending C++ with an Object Query Capability.
##José A. Blakeley
#year1995
#confModern Database Systems
#citation14
#index0
#arnetid2
#*Transaction Management in Multidatabase Systems.
##Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz
#year1995
#confModern Database Systems
#citation22
#index1
#arnetid3
#*Overview of the ADDS System.
##Yuri Breitbart,Tom C. Reyes
#year1995
#confModern Database Systems
#citation-1
#index2
#arnetid4
And the different symbols are supposed to correspond to:
#* --- paperTitle
## --- Authors
#year ---- Year
#conf --- publication venue
#citation --- citation number (both -1 and 0 means none)
#index ---- index id of this paper
#arnetid ---- pid in arnetminer database
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
#! --- Abstract
Not sure how to set this up so the data gets read correctly. Ideally, I would want a data frame where each category is a different column, and then all the entries in the document are rows. Thanks!
My regex is not as up to speed as it should be, but the below might work so long as the data remains in the same form and the column names aren't duplicated in other lines:
import re
import pandas as pd
path = r"filepath.txt"
f = open(path, 'r')
year = []
confModern = []
#continue for all columns
for ele in f:
if len(re.findall('year', ele)) > 0:
year.append(ele[5:])
if len(re.findall('confModern', ele)) > 0:
year.append(ele[12:])
# continue for all columns with the needed string
df = pd.DataFrame(data={'year' : year ...#continue for each list})
Related
I have to work on a flat file (size > 500 Mo) and I need to create to split file on one criterion.
My original file as this structure (simplified):
JournalCode|JournalLib|EcritureNum|EcritureDate|CompteNum|
I need to create to file depending on the first digit from 'CompteNum'.
I have started my code as well
import sys
import pandas as pd
import numpy as np
import datetime
C_FILE_SEP = "|"
def main(fic):
pd.options.display.float_format = '{:,.2f}'.format
FileFec = pd.read_csv(fic, C_FILE_SEP, encoding= 'unicode_escape')
It seems ok, my concern is to create my 2 files based on criteria. I have tried with unsuccess.
TargetFec = 'Target_'+fic+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+'.txt'
target = open(TargetFec, 'w')
FileFec = FileFec.astype(convert_dict)
for row in FileFec.iterrows():
Fec_Cpt = str(FileFec['CompteNum'])
nb = len(Fec_Cpt)
if (nb > 7):
target.write(str(row))
target.close()
the result of my target file is not like I expected:
(0, JournalCode OUVERT
JournalLib JOURNAL D'OUVERTURE
EcritureNum XXXXXXXXXX
EcritureDate 20190101
CompteNum 101300
CompteLib CAPITAL SOUSCRIT
CompAuxNum
CompAuxLib
PieceRef XXXXXXXXXX
PieceDate 20190101
EcritureLib A NOUVEAU
Debit 000000000000,00
Credit 000038188458,00
EcritureLet NaN
DateLet NaN
ValidDate 20190101
Montantdevise
Idevise
CodeEtbt 100
Unnamed: 19 NaN
And I expected to obtain line into my target file when CompteNum(0:1) > 7
I have read many posts for 2 days, please some help will be perfect.
There is a sample of my data available here
Philippe
Suiting the rules and the desired format, you can use logic like:
# criteria:
verify = df['CompteNum'].apply(lambda number: str(number)[0] == '8' or str(number)[0] == '9')
# saving the dataframes:
df[verify].to_csv('c:/users/jack/desktop/meets-criterios.csv', sep = '|', index = False)
Original comment:
As I understand it, you want to filter the imported dataframe according to some criteria. You can work directly on the pandas you imported. Look:
# criteria:
verify = df['CompteNum'].apply(lambda number: len(str(number)) > 7)
# filtering the dataframe based on the given criteria:
df[verify] # meets the criteria
df[~verify] # does not meet the criteria
# saving the dataframes:
df[verify].to_csv('<your path>/meets-criterios.csv')
df[~verify].to_csv('<your path>/not-meets-criterios.csv')
Once you have the filtered dataframes, you can save them or convert them to other objects, such as dictionaries.
I have a CSV file containing daily data on yields of different government bonds of varying maturities. The headers are formatted as by the country followed by the maturity of the bond, for eg UK 10Y. What I would like to do is just import all the yields for one government bond at all maturities for one date, so for example import all the UK government bond yields at a particular date. The first date is 07/01/2021.
I know I can use Pandas, but all the codes I have seen require to use usecols function when importing. I'd like to just create a function and import only the data that I want without using usecols.
Snapshot of data, UK data is further right, but format is the same
You can try:
import time
import datetime
col_to_check = "UK government bond yields"
get_after = "07/01/2021"
get_after = time.mktime(datetime.datetime.strptime(get_after, "%d/%m/%Y").timetuple())
with open("yourfile.csv", "r") as msg:
data = msg.readlines()
index_to_check = data[0].split(",").index(col_to_check)
for i, v in enumerate(data):
if i == 0:
pass
else:
date = time.mktime(datetime.datetime.strptime(v.split(",")[index_to_check], "%d/%m/%Y").timetuple())
if date > get_after:
pass
else:
data[i] = ""
print ([x for x in data if x])
This is untested code as you did not provide a sample input but in principle it should work.
You have the header name of the column you want to check, the limit date.
You get the index of the first in your csv row. You convert the limit date to timestamp integer.
Then you read your data line by line and check. If the date/timestamp is greater than your limit you pass, else you assign empty value at the corresponding index of data.
Finally you remove empty elements to get the final list.
I am researching the impact of news article sentiment related to a financial instrument and its potenatial effect on its instruments's price. I have tried to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of an instrument at that time, and at several itervals after that time, in our case t+2. However, program created twoM to the file, but does not return any calculated price changes
Previously, I used Reuters Eikon and its functions to conduct the research, described in the article below.
https://developers.refinitiv.com/article/introduction-news-sentiment-analysis-eikon-data-apis-python-example
However, instead of using data available from Eikon, I would like to use my own csv news file with my own price data from another csv file. I am trying to match the
excel_file = 'C:\\Users\\Artur\\PycharmProjects\\JRA\\sentimenteikonexcel.xlsx'
df = pd.read_excel(excel_file)
sentiment = df.Sentiment
print(sentiment)
start = df['GMT'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
end = df['GMT'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
spot_data = 'C:\\Users\\Artur\\Desktop\\stocksss.csv'
spot_price_10 = pd.read_csv(spot_data)
print(spot_price_10)
df['twoM'] = np.nan
for idx, newsDate in enumerate(df['GMT'].values):
sTime = df['GMT'][idx]
sTime = sTime.replace(second=0, microsecond=0)
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df['twoM'][idx] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
print(df)
However, the programm is not able to return the twoM price change values
I assume that you got a warning because you are trying to make changes on views. As soon as you have 2 [] (one for the column, one for the row) you can only read. You must use loc or iloc to write a value:
...
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df.loc[idx,'twoM'] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
...
I am trying to parse data from CSV files. The files are in a folder and I want to extract data and write them to the db. However the csvs are not set up in a table format. I know how to import csvs into the db with the for each loop container, adding data flow tasks, and importing with OLE DB Destination.
The problem is just getting one value out of these csvs. The format of the file is as followed:
Title Title 2
Date saved ##/##/#### ##:## AM
Comment
[ Main ]
No. Measure Output Unit of measure
1 Name 8 µm
Count 0 pcs
[ XY Measure ]
X
Y
D
[ Area ]
No. Area Unit Perimeter Unit
All I want is just the output which is "8", to snatch the name of the file to make it name of the result or add it to a column, and the date and time to add to their own columns.
I am not sure which direction to head into and i hope someone has some things for me to look into. Originally, I wasn't sure if I should do the parsing externally (python) before using SQL server. If anyone knows another way I should use to get this done please let me know. Sorry for the unclear post earlier.
The expect outcome:
Filename Date Time Outcome
jnnnnnnn ##/##/#### ##:## 8
I'd try this:
filename = # from the from the path of the file you're parsing
# define appropriate vars
for row in csv_file:
if row.find('Date saved') > 0:
row = row.replace('Date saved ')
date_saved = row[0:row.find(' ')]
row = row.replace(date_saved + ' ')
time = row[0:row.find(' ')]
elif row.find(u"\u03BC"):
split_row = row.split(' ')
outcome = split_row[2]
# add filename,date_saved,time,outcome to data that will go in DB
I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)