I have to work on a flat file (size > 500 Mo) and I need to create to split file on one criterion.
My original file as this structure (simplified):
JournalCode|JournalLib|EcritureNum|EcritureDate|CompteNum|
I need to create to file depending on the first digit from 'CompteNum'.
I have started my code as well
import sys
import pandas as pd
import numpy as np
import datetime
C_FILE_SEP = "|"
def main(fic):
pd.options.display.float_format = '{:,.2f}'.format
FileFec = pd.read_csv(fic, C_FILE_SEP, encoding= 'unicode_escape')
It seems ok, my concern is to create my 2 files based on criteria. I have tried with unsuccess.
TargetFec = 'Target_'+fic+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+'.txt'
target = open(TargetFec, 'w')
FileFec = FileFec.astype(convert_dict)
for row in FileFec.iterrows():
Fec_Cpt = str(FileFec['CompteNum'])
nb = len(Fec_Cpt)
if (nb > 7):
target.write(str(row))
target.close()
the result of my target file is not like I expected:
(0, JournalCode OUVERT
JournalLib JOURNAL D'OUVERTURE
EcritureNum XXXXXXXXXX
EcritureDate 20190101
CompteNum 101300
CompteLib CAPITAL SOUSCRIT
CompAuxNum
CompAuxLib
PieceRef XXXXXXXXXX
PieceDate 20190101
EcritureLib A NOUVEAU
Debit 000000000000,00
Credit 000038188458,00
EcritureLet NaN
DateLet NaN
ValidDate 20190101
Montantdevise
Idevise
CodeEtbt 100
Unnamed: 19 NaN
And I expected to obtain line into my target file when CompteNum(0:1) > 7
I have read many posts for 2 days, please some help will be perfect.
There is a sample of my data available here
Philippe
Suiting the rules and the desired format, you can use logic like:
# criteria:
verify = df['CompteNum'].apply(lambda number: str(number)[0] == '8' or str(number)[0] == '9')
# saving the dataframes:
df[verify].to_csv('c:/users/jack/desktop/meets-criterios.csv', sep = '|', index = False)
Original comment:
As I understand it, you want to filter the imported dataframe according to some criteria. You can work directly on the pandas you imported. Look:
# criteria:
verify = df['CompteNum'].apply(lambda number: len(str(number)) > 7)
# filtering the dataframe based on the given criteria:
df[verify] # meets the criteria
df[~verify] # does not meet the criteria
# saving the dataframes:
df[verify].to_csv('<your path>/meets-criterios.csv')
df[~verify].to_csv('<your path>/not-meets-criterios.csv')
Once you have the filtered dataframes, you can save them or convert them to other objects, such as dictionaries.
Related
I have a CSV file containing daily data on yields of different government bonds of varying maturities. The headers are formatted as by the country followed by the maturity of the bond, for eg UK 10Y. What I would like to do is just import all the yields for one government bond at all maturities for one date, so for example import all the UK government bond yields at a particular date. The first date is 07/01/2021.
I know I can use Pandas, but all the codes I have seen require to use usecols function when importing. I'd like to just create a function and import only the data that I want without using usecols.
Snapshot of data, UK data is further right, but format is the same
You can try:
import time
import datetime
col_to_check = "UK government bond yields"
get_after = "07/01/2021"
get_after = time.mktime(datetime.datetime.strptime(get_after, "%d/%m/%Y").timetuple())
with open("yourfile.csv", "r") as msg:
data = msg.readlines()
index_to_check = data[0].split(",").index(col_to_check)
for i, v in enumerate(data):
if i == 0:
pass
else:
date = time.mktime(datetime.datetime.strptime(v.split(",")[index_to_check], "%d/%m/%Y").timetuple())
if date > get_after:
pass
else:
data[i] = ""
print ([x for x in data if x])
This is untested code as you did not provide a sample input but in principle it should work.
You have the header name of the column you want to check, the limit date.
You get the index of the first in your csv row. You convert the limit date to timestamp integer.
Then you read your data line by line and check. If the date/timestamp is greater than your limit you pass, else you assign empty value at the corresponding index of data.
Finally you remove empty elements to get the final list.
I would like to randomly select a string of characters from this list of symbols without replacement: '#','+','?','!','$','*','%','#','}','>','&','^'.
The length of the generated string would be equal in length to the word in another column from the csv.
Example of an existing csv:
Word Length
dog 3
wolf 4
cactus 6
bus 3
I would like to have the code such that it appends a third column to the existing csv file with the generated string equal in length for each word. This is an example of the result I want:
Word Length String
dog 3 #!#
wolf 4 &*%!
cactus 6 ^?!##%
bus 3 }&^
This is the code I tried but I do not think it is right.
import random
import pandas as pd
import os
cwd = os.getcwd()
cwd
os.chdir("/Users/etcetc") #change directory
df = pd.read_csv('generatingstring.csv')
list1 = ['#','+','?','!','$','*','%','#','}','>','&','^']
list2 = df['String'] #creating a new column for the generated string
for row in df['Length']: #hope this reads each row in that column
for n in range(1, row): #hope this reads the length value within cell
s = random.choice(list1)
list1.remove(s) #to ensure random selection without replacement
list2.append(s)
I was hoping to make it read each row within the Length column, and within each row take note of how many symbols to randomly select.
Thank you!
You can try
import numpy as np
df.Word.map(lambda x : ''.join(np.random.choice(list1,len(x),replace = False)))
Out[145]:
0 &$!
1 >^$!
2 #}%?$>
3 #+!
Name: Word, dtype: object
Trying to read a .txt file into my Jupyter notebook.
This is my code:
acm = pd.read_csv('outputacm.txt', header=None, error_bad_lines=False)
print(acm)
Here is a sample of my txt file:
2244018
#*OQL[C++]: Extending C++ with an Object Query Capability.
##José A. Blakeley
#year1995
#confModern Database Systems
#citation14
#index0
#arnetid2
#*Transaction Management in Multidatabase Systems.
##Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz
#year1995
#confModern Database Systems
#citation22
#index1
#arnetid3
#*Overview of the ADDS System.
##Yuri Breitbart,Tom C. Reyes
#year1995
#confModern Database Systems
#citation-1
#index2
#arnetid4
And the different symbols are supposed to correspond to:
#* --- paperTitle
## --- Authors
#year ---- Year
#conf --- publication venue
#citation --- citation number (both -1 and 0 means none)
#index ---- index id of this paper
#arnetid ---- pid in arnetminer database
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
#! --- Abstract
Not sure how to set this up so the data gets read correctly. Ideally, I would want a data frame where each category is a different column, and then all the entries in the document are rows. Thanks!
My regex is not as up to speed as it should be, but the below might work so long as the data remains in the same form and the column names aren't duplicated in other lines:
import re
import pandas as pd
path = r"filepath.txt"
f = open(path, 'r')
year = []
confModern = []
#continue for all columns
for ele in f:
if len(re.findall('year', ele)) > 0:
year.append(ele[5:])
if len(re.findall('confModern', ele)) > 0:
year.append(ele[12:])
# continue for all columns with the needed string
df = pd.DataFrame(data={'year' : year ...#continue for each list})
I have thousands of rows in a list like the one below that I would like to convert into a pandas table consisting of different columns.
2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe
Ideally I would like to to create a pandas structure with the following columns: Date, Sale, ID, Region, and then fill it with values that fit the values.
E.g. so in the first row I have sales = 120, ID = 534343, region = North America and date = 2018-12-03 21:15:24.
Given that I have thousands of rows, what code could make this work?
Supposing your list is in a file, read it first into a string (or into a list already, in which case following code will differ) and then apply code.
To read into a string:
with open('/file/path/myfile.txt','r') as f:
s = f.read()
Code for parsing:
import re
import pandas as pd
s = """2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe"""
sales_re = "Sales:([0-9]+)"
id_re = "ID:([0-9]+)"
lst = []
for line in s.split('\n'):
date = line[0:19]
sale = re.search(sales_re, line).groups()[0]
id = re.search(id_re, line).groups()[0]
region = line[line.rfind(":")+1+len(id)+1:] # Search from last ":", add one to go over ":" and 1 to skip space
x = [date, sale, id, region]
lst.append(x)
df = pd.DataFrame(lst)
df.columns = ['date', 'sale', 'id', 'region']
In the example above, I assume everything is loaded into a string. Then I use regular expressions to extract harder part of each line and append everything into a list that. Then I use the pandas.DataFrame constructor to convert into a dataframe.
I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)