Finding most and least frequent rows of a text file in python

Finding most and least frequent rows of a text file in python - python

I have a text file with only one column containing text content. I want to find out the top 3 most frequent items and the 3 least frequent items. I have tried some of the solutions in other posts, but I am not able to get what I want. I tried finding the mode as shown below, but it just outputs all the rows. I also tried using the counter and the most common functions, but they do the same thing i.e. print all the rows in the file. Any help is appreciated.
# My Code
import pandas as pd
df = pd.read_csv('sample.txt')
print(df.mode())

You can use Python's built-in counter.
from collections import Counter
# Read file directly into a Counter
with open('file') as f:
cnts = Counter(l.strip() for l in f)
# Display 3 most common lines
cnts.most_common(3)
# Display 3 least common lines
cnts.most_common()[-3:]

Related

Extracting tables from PDF using tabula-py fails to properly detect rows

Problem
I want to extract a 70-page vocabulary table from a PDF and turn it into a CSV to use in [any vocabulary learning app].
Tabula-py and its read_pdf function is a popular solution to extract the tables, and it did detect the columns ideally without any fine-tuning. But, it only detected the columns well and had difficulties with the multi-line rows, splitting each line into a different row.
E.g., in the PDF you will have columns 2 and 3. The table on Stackoverflow doesn't seem to allow multi-line content either, so I added row numbers. Just merge the row 1 in your head.
Row number
German
Latin
1
First word
Translation for first word
1
with many lines of content
[phonetic vocabulary thingy]
1
and more lines
2
Second word
Translation for second word
Instead of fine-tuning the read_pdf parameters, are there ways around that?

You may want to use PyMuPDF. As your table cells are wrapped by lines, this is a relatively easy case.
I have published a script to answer a similar question here.

Possible solution
Instead of experimenting with tabula-py, which is perfectly legit of course, you can export a pdf in Adobe Reader using File->Export a PDF->HTML Web Page
You then read it using
import pandas as pd
dfs = pd.read_html("file.html", header=0,encoding='utf-8')
to get a list of pandas dataframes. You could also use BeautifulSoup4 or similar solutions to extract the tables.
To match tables with the same column names (e.g., in a vocabulary table) and save them as csv, you can do this:
from collections import defaultdict
unique_columns_to_dataframes = defaultdict(list)
# We need to get a hashable key for the dictionary, so we join the df.columns.values. Strings can be hashed.
possible_column_variations = [("%%".join(list(df.columns.values)), i) for i, df in enumerate(dfs)]
for k, v in possible_column_variations:
unique_columns_to_dataframes[k].append(v)
for k, v in unique_columns_to_dataframes.items():
new_df = pd.concat([dfs[i] for i in v])
new_df.reset_index(drop=True,inplace=True)
# Save file with a unique name. Unique name is a hash out from the characters in the column_names, not collision-free but unlikely to collide for small number of tables
new_df.to_csv("Df_"+str(sum([ord(c) for c in k]))+".csv", index=False, sep=";", encoding='utf-8')

Need help using Python to find and count different reoccurring elements in a file

I am relatively new to Python and I was wondering if it were possible to use python to count specific reoccurring elements in something like a text file. For example, if the file had:
ID1002
ID1002
ID1001
ID1003
ID1003
ID1003
Would it be possible to count how many times each id appears and store that information somewhere? Thank you for the help.

You could use the readlines() function to get a list of all the IDs and use the .count() function.
f = open("yourfile.txt")
counts = {}
for i in f.readlines():
counts[i] = f.readlines().count(i)
print(counts)
This code goes through the list containing all the IDs and saves the ID and the number of times the ID occurs in the file to a dictionary.

Extraction of complete rows from CSV using list , we dont know row indices

Can somebody help me in solving the below problem
I have a CSV, which is relatively large with over 1 million rows X 4000 columns. Case ID is one of the first column header in csv. Now I need to extract the complete rows belonging to the few case Ids, which are documented in list as faulty IDs.
Note: I dont know the indices of the required case IDs
Example > the CSV is - production_data.csv and the faulty IDs, faulty_Id= [ 50055, 72525, 82998, 1555558]
Now, we need to extract the complete rows for faulty_Id= [ 50055, 72525, 82998, 1555558]
Best Regards

If your faculty_id is present as header in csv file, you can use pandas dataframe to read_csv and set index as faculty_id and extract rows based on the faculty_id. For more info attach sample data of csv

CSV, which is relatively large with over 1 million rows X 4000 columns
As CSV are just text files and it is probably to big to be feasible to load it as whole I suggest using fileinput built-in module, if ID is 1st column, then create extractfaults.py as follows:
import fileinput
faulty = ["50055", "72525", "82998", "1555558"]
for line in fileinput.input():
if fileinput.lineno() == 0:
print(line, end='')
elif line.split(",", 1)[0] in faulty:
print(line, end='')
and use it following way
python extractfaults.py data.csv > faultdata.csv
Explanation: keep lines which are either 1st line (header) or have one of provided ID (I used optional 2nd .split argument to limit number of splits to 1). Note usage of end='' as fileinput keeps original newlines. My solution assumes that IDs are not quoted and ID is first column, if any of these does not hold true, feel free to adjust my code to your purposes.

The best way for you is to use a database like Postgres or MySQL. You can copy your data to the database first and then easily operate rows and columns. The file in your case is not the best solution since you need to upload all the data from the file to the memory to be able to process it. And file opening takes a lot of time in addition.

How to make read_csv more flexibile with numbers and whitespaces

I want to read a txt.file with Pandas and the Problem is the seperator/delimiter consits of a number and Minimum two blanks afterwards.
I already tried it similiar to this code (How to make separator in pandas read_csv more flexible wrt whitespace?):
pd.read_csv("whitespace.txt", header=None, delimiter=r"\s+")
This is only working if there is only a blank or more. So I adjustet it to the following code.
delimiter=r"\d\s\s+"
But this is seperating my dataframe when it sees two blanks or more, but i strictly Need the number before it followed by at least two blanks, anyone has an idea how to fix it?
My data Looks as follows:
I am an example of a dataframe
I have Problems to get read
100,00
So How can I read it
20,00
so the first row should be:
I am an example of a dataframe I have Problems to get read 100,00
followed by the second row:
So HOw can I read it 20,00

Id try it like this.
Id manipulate the text file before I attempt to parse it to a dataframe as follows:
import pandas as pd
import re
f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")
prepared_text = re.sub(r'(\d+,\d+)', r'\1#', g)
df = pd.DataFrame({'My columns':prepared_text.split('#')})
print(df)
This gives the following:
My columns
0 I am an example of a dataframe I have Problems...
1 So How can I read it 20,00
2
I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.
The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+#.
Then we use the inserted character as a delimiter.
There are some good examples here:
https://lzone.de/examples/Python%20re.sub

Python: Most efficient method to find the most common string

I want to find the 20 most common names, and their frequency, in a country.
Lets say I have lists of all residents' first name in 100 cities. Each list might contain a lot of names. Lets say we speak about 100 lists, each list with 1000 strings.
What is the most efficient method to get the 20 most common names, and their frequencies, in the entire country?
This is the direction I began with, assuming I got each city in a text file at the same directory:
Use pandas and collection modules for this.
Iterate through each city.txt, making it a string. Then, turn it into a collection using the Counter module, and then to a DataFrame (using to_dict).
Union each DataFrame with the previous one.
Then, group by and count (*) the DataFrame.
But, I'm thinking this method might not work, as the DataFrame can get too big.
Would like to hear any advice on that. Thank you.

Here is a sample code:
import os
from collections import Counter
cities = [i for i in os.listdir(".") if i.endswith(".txt")]
d = Counter()
for file in cities:
with open(file) as f:
# Adjust the code below to put the strings in a list
data = f.read().split(",")
d.update(Counter(data))
out = d.most_common(10)
print(out)

You can also use NLTK library, I was using the code below for similar purpose.
from nltk import FreqDist
fd = FreqDist(text)
top_20 = fd.most_commmon(20) # it's done, you got top 20 tokens :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding most and least frequent rows of a text file in python - python

You can use Python's built-in counter. from collections import Counter # Read file directly into a Counter with open('file') as f: cnts = Counter(l.strip() for l in f) # Display 3 most common lines cnts.most_common(3) # Display 3 least common lines cnts.most_common()[-3:]

Related

Extracting tables from PDF using tabula-py fails to properly detect rows

Need help using Python to find and count different reoccurring elements in a file

Extraction of complete rows from CSV using list , we dont know row indices

How to make read_csv more flexibile with numbers and whitespaces

Python: Most efficient method to find the most common string

Categories

Resources