My Python script parsed some text of a Excel file. It strips white-space from an Excel file and changes the delimiters
(from " : "--> " , ")
and my script outputs to a CSV file. Much of the data looks like this
(what data looks like in Excel)
Separated by a single column due to there being a extra comma or two.
CSV == Comma separated values.
I have tried using if statements to add or subtract commas to try shore it up but it ends up completely messing up the relative order it was first in. Driving me nuts!
To try do it another way installed the pandas library (a data manipulating library) using pip.
Is it possible to merge columns that have no column headers inside a single Data Frame? There's plenty of advice regarding separate DataFrames but much for one single one.
Furthermore how can I merge the columns while retaining the row position. The emails are in the correct row position but not the column position.
Or am I on the wrong track completely, is pandas overkill for a simple parsing script? I've been learning python as I go along to try complete the script so I might have missed a simple way of doing it.
Some sample data:
C5XXEmployeeNumXX,C5XXEmployeeNumXX,JohnSmith,1,,John,,Smith,,IT Supp.Centre,EU,,London1,,,59XXXX,ITServiceDesk,LOND01,,,,Notmaintained,,,,,,,,john.smith#company.com,
Snippet of parsing logic
for line in f:
#finds the identifier for users
if ':LON ' in line:
#parsing logic.
#Delimiters are swapped. Whitespace is scrubbed
line = line.replace(':', ',')
line = line.replace(' ', '')
You can user a separator/delimiter of your choice. Check out: https://docs.python.org/2/library/csv.html#csv.Dialect.delimiter.
Also, regarding the order, if you are reading in a list it should be fine but if you are reading the contents of a row in a dict then it is normal that the order is not preserved.
Related
Im developing an API which should, ideally, export a conmma-separated list as a .txt file which should look like
alphanumeric1, alphanumeric2, alphanumeric3
the data to be exported is coming from a column of a pandas dataframe, so I guess I get it, but all my attempts to get it as a single-line string literal havent worked. Instead, the text file I receive is
,ColumnHeader
0,alphanumeric1
0,alphanumeric2
0,alphanumeric3
I've tried using string literals with the backticks, writing to multiple lines, appending commas to each value in the list, but it all comes out in the form of a csv, which wont work for my purposes.
How would yall achieve this effect?
I am not sure if what you need is:
csvList = ','.join(df.ColumnHeader)
where, df is of course your pandas dataframe
Can somebody help me in solving the below problem
I have a CSV, which is relatively large with over 1 million rows X 4000 columns. Case ID is one of the first column header in csv. Now I need to extract the complete rows belonging to the few case Ids, which are documented in list as faulty IDs.
Note: I dont know the indices of the required case IDs
Example > the CSV is - production_data.csv and the faulty IDs, faulty_Id= [ 50055, 72525, 82998, 1555558]
Now, we need to extract the complete rows for faulty_Id= [ 50055, 72525, 82998, 1555558]
Best Regards
If your faculty_id is present as header in csv file, you can use pandas dataframe to read_csv and set index as faculty_id and extract rows based on the faculty_id. For more info attach sample data of csv
CSV, which is relatively large with over 1 million rows X 4000 columns
As CSV are just text files and it is probably to big to be feasible to load it as whole I suggest using fileinput built-in module, if ID is 1st column, then create extractfaults.py as follows:
import fileinput
faulty = ["50055", "72525", "82998", "1555558"]
for line in fileinput.input():
if fileinput.lineno() == 0:
print(line, end='')
elif line.split(",", 1)[0] in faulty:
print(line, end='')
and use it following way
python extractfaults.py data.csv > faultdata.csv
Explanation: keep lines which are either 1st line (header) or have one of provided ID (I used optional 2nd .split argument to limit number of splits to 1). Note usage of end='' as fileinput keeps original newlines. My solution assumes that IDs are not quoted and ID is first column, if any of these does not hold true, feel free to adjust my code to your purposes.
The best way for you is to use a database like Postgres or MySQL. You can copy your data to the database first and then easily operate rows and columns. The file in your case is not the best solution since you need to upload all the data from the file to the memory to be able to process it. And file opening takes a lot of time in addition.
I'm writing a script that organises data for an Neural Network project, specifically a sentence and it's label I assign it. The part of my script that outputs the data as a .csv file (that I had temporarily stored in a list) is so:
with open(out_file, 'w+') as out:
out.write("sentence, label \n") # Write a header for .csv file
for item in corp_list:
out.write(item + '\n') # Item is intended to look like: '[sentence], [label]'
Like above, each item in corp_list is intended to be formatted like in this example:
I like to go to the mountains, L
Where 'L' is the label I assign it.
Most of my data when I load it useing pd.read_csv looks perfect, with the newline separating each entry as intended. But, there are about 11,000 entries that look something like this:
He is my brother, E\nWe can't wait to go to on holidays, N\nMy father was a painter, T\nShe hates the sea, E
It starts 'merging' entries into one big entry which renders my dataset very hard to use. I'm really unsure as to why most newlines work, but some of these for some reason don't. How I format the data and write it to the file never changes for all of my 16 million entries.
Any advice on whether it's thought to be a newline/code issue or perhaps within my own datset.
Edit:
My data has no commas, just to note.
This problem does not happen when I write the same list to a normal .txt file. It only happens when I write it and then read it as a CSV, either through a Pandas dataframe or through the CSV module's reader method.
Also, when I output my list to a txt file and then load that back in sentence by sentence into a list vs. a csv into a dataframe, the incorrect entries are slightly changed. All incorrect entries are missing the spaces at the comma, for example here is what a correct entry looks like:
I like to go to the mountains, L
compared to an incorrect entry (of course, like mentioned, with a lot more concatenated onto it):
I like to go to the mountains,L
And only the last label from any long, incorrect string entry is set as the label.
Try using this, separating the two strings
with open(out_file, 'w+') as out:
out.write("sentence, label \n") # Write a header for .csv file
for item in corp_list:
out.write(item)
out.write('\n') # Item is intended to look like: '[sentence], [label]'
or try using f-strings
with open('out_file', 'w+') as out:
out.write("sentence, label \n") # Write a header for .csv file
for item in corp_list:
out.write(f'{item}\n') # Item is intended to look like: '[sentence], [label]'
It is not recommended to use '+' sign string concatenation for high complexity. Maybe there are some implicit string concatenations hidden in your data, being responsible for the merges. If this doesn't help, the problem pretty likely originates from your data.
I am trying to load a large text file into python dataframe. One thing I noticed is, if I want to load it successfully, I have to drop all the bad lines. But I would like to load all rows first then take a look then clean it manually. Is there a way to do that?
data = pd.read_csv('filename.txt', sep="\t", error_bad_lines=False, engine='python')
Here's warnings I've got. It's a common error, but all solutions are just skipping them, I really need to load all rows... any thought?
Skipping XXX line: Expected 28 fields in line XXX, saw 29
Without knowing more about the specific CSV file, it looks like there is either:
Too many columns in that row (an extra comma)
Quoting is off meaning there's a comma that should be quoted but isn't
The best way to remedy this is to fix the problem in the CSV file.
Technically you're not just loading the file, but also parsing it at the same time. It looks like you've handled the delimiter properly, so as you may have guessed you have too many columns or too few in some of your rows. That may actually be the case, or perhaps you have tabs within text fields that are being interpreted as delimiters.
In any case, pandas isn't going to parse those inconsistent lines.
A typical approach is to open the file in a robust text editor and look at the lines that are erroring out in Pandas. See what's actually wrong and either fix it in the text editor, or use python's native open() function to load the entire file and iterate line by line, with logic that fixes whatever the problem is.
Once you're certain that you have the same number of columns in every row load it with Pandas.
I have 100 csv files, each contains publication data of different institutions and I would like to perform the same manipulation on all of them:
1.Get the Institution name from cell B1. This is always after 'at' or 'at the'. For example 'Publications at Tohoku University'
2.Vlookup the matching InstitutionCode from another csv file called 'Codes'.
For example '1286'. (for Tohoku University).
3.Delete rows 1-14 (including the Insitution name in cell B1)
4. Insert two extra columns (column A and B) to the file with he following headers: 'Institution' and 'InstitutionCode' and fill it with the relevant information for all rows where I have data.
(In the above example Tohoku University and 1286).
I am new to Python and find it hard to put together this script from the resources I have found.
Can anyone please help me?
Below is image of the data in original format
Below is the image of the result required
I could give you the code, but instead, I'll explain to you how you can write it yourself.
Read the Codes file and store the institutions and codes in a dictionary.
You can read more about reading csv files here: https://pymotw.com/2/csv/ or here: https://pymotw.com/3/csv/.
Each row will be represented as a list of strings, so you can access cell elements by their index. Make the Institution names the keys and the codes the values.
Read the csv files one by one in a for loop. I'll call these the input files. Open a new file for writing for each input file that you read. I'll call these the output files.
Loop over the rows in the csv file. You can keep track of the row numbers by using enumerate. You can find info on this here for example: http://book.pythontips.com/en/latest/enumerate.html.
Get the contents of cell B1 by taking element 1 from row 0.
Find the Institution name by using a regular expression. More info here for example: http://dev.tutorialspoint.com/python/python_reg_expressions.htm
And get the Institution code from the dictionary you made in step 1.
Keep looping over the rows, until the first element equals 'Title'. This row contains the headers. Write "Institution" and "InstitutionCode" to the output file, followed by the headers you just found. To do this, convert your row (a list of strings) to a tuple (http://www.tutorialspoint.com/python/python_tuples.htm) and give that as an argument to the writerow method of the csv writer object (see the links in step 1).
Then for each row after the header row, make a tuple of the Institution name and code, followed by the information from the row from the input file you just read, and give that as an argument to the writerow method of the csv writer object.
Close output file.
One thing to think about is whether you want quotes around the cell contents in the output files. You can read about this in the links in step 1. The same goes for the field delimiters. If you don't specify anything, they are assumed to be commas, but you can change this.
I hope this helps!