How to make read_csv more flexibile with numbers and whitespaces

How to make read_csv more flexibile with numbers and whitespaces - python

I want to read a txt.file with Pandas and the Problem is the seperator/delimiter consits of a number and Minimum two blanks afterwards.
I already tried it similiar to this code (How to make separator in pandas read_csv more flexible wrt whitespace?):
pd.read_csv("whitespace.txt", header=None, delimiter=r"\s+")
This is only working if there is only a blank or more. So I adjustet it to the following code.
delimiter=r"\d\s\s+"
But this is seperating my dataframe when it sees two blanks or more, but i strictly Need the number before it followed by at least two blanks, anyone has an idea how to fix it?
My data Looks as follows:
I am an example of a dataframe
I have Problems to get read
100,00
So How can I read it
20,00
so the first row should be:
I am an example of a dataframe I have Problems to get read 100,00
followed by the second row:
So HOw can I read it 20,00

Id try it like this.
Id manipulate the text file before I attempt to parse it to a dataframe as follows:
import pandas as pd
import re
f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")
prepared_text = re.sub(r'(\d+,\d+)', r'\1#', g)
df = pd.DataFrame({'My columns':prepared_text.split('#')})
print(df)
This gives the following:
My columns
0 I am an example of a dataframe I have Problems...
1 So How can I read it 20,00
2
I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.
The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+#.
Then we use the inserted character as a delimiter.
There are some good examples here:
https://lzone.de/examples/Python%20re.sub

Related

Extracting tables from PDF using tabula-py fails to properly detect rows

Problem
I want to extract a 70-page vocabulary table from a PDF and turn it into a CSV to use in [any vocabulary learning app].
Tabula-py and its read_pdf function is a popular solution to extract the tables, and it did detect the columns ideally without any fine-tuning. But, it only detected the columns well and had difficulties with the multi-line rows, splitting each line into a different row.
E.g., in the PDF you will have columns 2 and 3. The table on Stackoverflow doesn't seem to allow multi-line content either, so I added row numbers. Just merge the row 1 in your head.
Row number
German
Latin
1
First word
Translation for first word
1
with many lines of content
[phonetic vocabulary thingy]
1
and more lines
2
Second word
Translation for second word
Instead of fine-tuning the read_pdf parameters, are there ways around that?

You may want to use PyMuPDF. As your table cells are wrapped by lines, this is a relatively easy case.
I have published a script to answer a similar question here.

Possible solution
Instead of experimenting with tabula-py, which is perfectly legit of course, you can export a pdf in Adobe Reader using File->Export a PDF->HTML Web Page
You then read it using
import pandas as pd
dfs = pd.read_html("file.html", header=0,encoding='utf-8')
to get a list of pandas dataframes. You could also use BeautifulSoup4 or similar solutions to extract the tables.
To match tables with the same column names (e.g., in a vocabulary table) and save them as csv, you can do this:
from collections import defaultdict
unique_columns_to_dataframes = defaultdict(list)
# We need to get a hashable key for the dictionary, so we join the df.columns.values. Strings can be hashed.
possible_column_variations = [("%%".join(list(df.columns.values)), i) for i, df in enumerate(dfs)]
for k, v in possible_column_variations:
unique_columns_to_dataframes[k].append(v)
for k, v in unique_columns_to_dataframes.items():
new_df = pd.concat([dfs[i] for i in v])
new_df.reset_index(drop=True,inplace=True)
# Save file with a unique name. Unique name is a hash out from the characters in the column_names, not collision-free but unlikely to collide for small number of tables
new_df.to_csv("Df_"+str(sum([ord(c) for c in k]))+".csv", index=False, sep=";", encoding='utf-8')

Extraction of complete rows from CSV using list , we dont know row indices

Can somebody help me in solving the below problem
I have a CSV, which is relatively large with over 1 million rows X 4000 columns. Case ID is one of the first column header in csv. Now I need to extract the complete rows belonging to the few case Ids, which are documented in list as faulty IDs.
Note: I dont know the indices of the required case IDs
Example > the CSV is - production_data.csv and the faulty IDs, faulty_Id= [ 50055, 72525, 82998, 1555558]
Now, we need to extract the complete rows for faulty_Id= [ 50055, 72525, 82998, 1555558]
Best Regards

If your faculty_id is present as header in csv file, you can use pandas dataframe to read_csv and set index as faculty_id and extract rows based on the faculty_id. For more info attach sample data of csv

CSV, which is relatively large with over 1 million rows X 4000 columns
As CSV are just text files and it is probably to big to be feasible to load it as whole I suggest using fileinput built-in module, if ID is 1st column, then create extractfaults.py as follows:
import fileinput
faulty = ["50055", "72525", "82998", "1555558"]
for line in fileinput.input():
if fileinput.lineno() == 0:
print(line, end='')
elif line.split(",", 1)[0] in faulty:
print(line, end='')
and use it following way
python extractfaults.py data.csv > faultdata.csv
Explanation: keep lines which are either 1st line (header) or have one of provided ID (I used optional 2nd .split argument to limit number of splits to 1). Note usage of end='' as fileinput keeps original newlines. My solution assumes that IDs are not quoted and ID is first column, if any of these does not hold true, feel free to adjust my code to your purposes.

The best way for you is to use a database like Postgres or MySQL. You can copy your data to the database first and then easily operate rows and columns. The file in your case is not the best solution since you need to upload all the data from the file to the memory to be able to process it. And file opening takes a lot of time in addition.

Any idea how to import this data set?

I have the following dataset:
https://github.com/felipe0216/fdfds/blob/master/sheffield_weather_station.csv
I know it is a csv, so I can use the pd.read_csv() function. However, if you look into the file it is not comma separated values nor a tab separated values, I am not sure what separation it has exactly. Any ideas about how to open it?

The proper way to do this is as follows:
df = pd.read_csv("sheffield_weather_station.csv", comment="#", delim_whitespace=True)
You should first download the .csv file. You have to tell pandas that there will be comments and that there will just be spaces separating the columns. The amount of spaces do not matter.

How to merge columns with no header name in a python script?

My Python script parsed some text of a Excel file. It strips white-space from an Excel file and changes the delimiters
(from " : "--> " , ")
and my script outputs to a CSV file. Much of the data looks like this
(what data looks like in Excel)
Separated by a single column due to there being a extra comma or two.
CSV == Comma separated values.
I have tried using if statements to add or subtract commas to try shore it up but it ends up completely messing up the relative order it was first in. Driving me nuts!
To try do it another way installed the pandas library (a data manipulating library) using pip.
Is it possible to merge columns that have no column headers inside a single Data Frame? There's plenty of advice regarding separate DataFrames but much for one single one.
Furthermore how can I merge the columns while retaining the row position. The emails are in the correct row position but not the column position.
Or am I on the wrong track completely, is pandas overkill for a simple parsing script? I've been learning python as I go along to try complete the script so I might have missed a simple way of doing it.
Some sample data:
C5XXEmployeeNumXX,C5XXEmployeeNumXX,JohnSmith,1,,John,,Smith,,IT Supp.Centre,EU,,London1,,,59XXXX,ITServiceDesk,LOND01,,,,Notmaintained,,,,,,,,john.smith#company.com,
Snippet of parsing logic
for line in f:
#finds the identifier for users
if ':LON ' in line:
#parsing logic.
#Delimiters are swapped. Whitespace is scrubbed
line = line.replace(':', ',')
line = line.replace(' ', '')

You can user a separator/delimiter of your choice. Check out: https://docs.python.org/2/library/csv.html#csv.Dialect.delimiter.
Also, regarding the order, if you are reading in a list it should be fine but if you are reading the contents of a row in a dict then it is normal that the order is not preserved.

Pandas appending .0 to a number

I'm having an issues with pandas that I'm a little baffled on. I have a file with a lot of numeric values that do not need calculations. Most of them are coming out just fine, but I have a couple that are getting ".0" appended to the end.
Here is a sample input file:
Id1 Id2 Age Id3
"SN19602","1013743", "24", "23523"
"SN20077","2567897", "28", "24687"
And the output being generated:
Id1 Id2 Age Id3
"SN19602","1013743.0", "24", "23523"
"SN20077","2567897.0", "28", "24687"
Can anyone explain why some but not all of the numeric values are getting the .0 appended, and if there is any way I can prevent it? It is a problem when I perform the next step of my process with the CSV output.
I have tried to convert the data frame and the column itself to a string but it did not make an impact. Ideally I do not want to list each column to convert because a have a very large number of columns and would manually have to go through the output file to figure out which ones are getting the .0 appended and code for it. Any suggestions appreciated.
import pandas as pd
import csv
df_inputFile = pd.read_csv("InputFile.csv")
df_mappingFile = pd.read_csv("MappingFile.csv")
df_merged = df_inputFile.merge(df_mappingFile, left_on="Id", right_on="Id", how="left")
#This isn't affecting the output
df_merged.astype(str)
df_merged.to_csv("Output.csv", index=False, quoting=csv.QUOTE_ALL)

pandas.DataFrame.to_csv has a parameter float_format, which takes a regular float formatting string. This should work:
df_merged.to_csv("Output.csv", index=False, quoting=csv.QUOTE_ALL, float_format='%.0f')

Pandas may be considering the datatype of that column as float that is the reason you are getting .0 appended to the data. You can use
dtype=object in pd.read_csv .
df_inputFile = pd.read_csv("InputFile.csv", dtype=object) .
This will make pandas read and consider all columns as string.

I like loops. They are slow, but easy to understand.
This is elegant for the logic, but also it allows different formatting/decimals for each column.
Something like:
final_out = open("Output.txt", 'w')
for index, row in df.iterrows():
print ( '{:.0f}'.format(row['A']), '{:.0f}'.format(row['B']), '{:.0f}'.format(row['C']), , sep=",", file=final_out )
I think the best/faster way to do this is with something like tabulate or pretty printer.
First convert your dataframe to an array, this is easy.
array = df.values
Then you can use something neat like tabulate.
final_out = open("Output.txt", 'w')
from tabulate import tabulate as tb
print ( tb(array, numalign="right", floatfmt=".0f"), file=final_out )
you can read up a little more on tabulate or pretty printer. Above is a contextual example to get you started.
Similar to the loop above, tabulate allows a separator which could be a comma.
https://pypi.python.org/pypi/tabulate at Usage of the command line utility.
Pretty sure pretty printer can do this too and could be very well a better choice.
Both of these uses the new python printing. If you use python 2.7 you will need this nifty little statement as your first non-comment line in your script:
from __future__ import print_function

I have recently faced this issue. In my case, the column similar to the Id2 column in question had an empty cell that Pandas interpreted as nan. All the other cells of that column had trailing .0.
Reading the file with keep_default_na=False helps to avoid those trailing .0.
my_df = pd.read_csv("data.csv", keep_default_na=False)
P.S: I know this answer is instead a late one, but this worked for me without enforcing data types while reading the data or having to float format.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.