How to add text from documents in a folder to an array

How to add text from documents in a folder to an array - python

Good afternoon. Unfortunately, I did not find an answer to a simple question. I have a document folder. PDF format. I can use Pandas to open one document and add its text to an array. Where the first column is the folder name and the second is the text from the document. But how do you do this for all documents in a folder? Alas, I don't know.
category
text
test
first document
test
second document
test
...

Assuming you put the code you already have into a function that takes in a file name and the DataFrame you have so far, it's pretty easy to do what you want:
import os
import pandas as pd
dataframe = pd.DataFrame()
files = os.listdir("[path/to/folder/]")
for file in files:
dataframe = addFileToTable(file, dataFrame)
If you're not sure how to add a new row to the dataframe:
def addFileToTable(file, dataframe):
# Convert PDF to array
# ...
row = {"category" : array[0], "text" : array[1]}
df = dataframe.append(row, ignore_index = True)
return df

Related

How can I filter a csv file based on its columns in python?

I have a CSV file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to write a code to pass the first row as the header and then extract data from two specific cities (Kish and Qeshm) and save it into a new CSV file. Somthing like this one:
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
It's worth mentioning that I'm very new to python.
I've written the following block to define the headers, but this is the furthest I've gotten so far.
import pandas as pd
path = '/Users/Desktop/sample.csv'
df = pd.read_csv(path , header=[0])
df.head = ()

You don't need to use header=... because the default is to treat the first row as the header, so
df = pd.read_csv(path)
Then, to keep rows on conditions:
df2 = df[df['City'].isin(['Kish', 'Qeshm'])]
And you can save it with
df2.to_csv(another_path)

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.

Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

Reading, calculate and group data of several files with pandas

I'm trying to make a small script to automate something at my work. I have a ton of text files that I need to group into a large dataframe to plot after.
The files have this general structure like this
5.013130280 4258.0
5.039390845 4198.0
... ...
49.944957015 858.0
49.971217580 833.0
What I want to do is
Keep the first column as the column of the final dataframe (as these values are the same for all files)
The rest of the dataframe is just extracting the second column of each file, normalize it and group everything together.
Use the file name as the header for extracted column (from point to) to use after in the plotting of the data
Right I was able to only make step 2, here is the code
import os
import pandas as pd
import glob
path = "mypath"
extension = 'xy'
os.chdir(path)
dir = os.listdir(path)
files = glob.glob(path + "/*.xy")
li = []
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df['int_n']=data['int']/data['int'].max()
li_norm.append(df['int_n'])
norm_files = pd.concat(li_norm, axis = 1)
So is there any way to solve this in an easy way?

Assuming that all of your files have exactly the same length (# of rows) and values for angles, then you don't really need to make a bunch of dataframes and concatenate them all together.
If I'm understanding correctly, you just want a final dataframe with a new column for each file (named with the filename) with the 'int' data, normalized with all the values from only that specific file
On the first file, you can create a dataframe to use as your final output, then just add columns to it on each subsequent file
for idx,file in enumerate(files):
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
filename = file.split('\\')[-1][:-3] #get filename from splitting full path and removing last 3 characters (file extension)
df[filename]=df['int']/df['int'].max() #use the filename itself as the new column name
if idx == 0: #create norm_files output dataframe on first file
norm_files = df[['angle',file]]
else: #add column to norm_files for all subsequent files
norm_files[file] = df[file]

You can add a calculated column quite simply, although I'm not sure if that's what you're asking.
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df[file.split('.')[0]]=data['int']/data['int'].max()
li_norm.append(df['int_n'])

How do I loop through a directory of PDF files and write the information to a Pandas Dataframe in Python?

I am very new to Python, Pandas, and NLP but have taken a few intro courses. I have a directory of 3 PDF files (will be over a hundred once I get the full data set). I want to open each file and make two columns in a Pandas dataframe that I can eventually use for some NLP work. The two columns needed are an ID column with the name of the PDF and the second column is just all of the text/information located within that PDF
This is the code I used to go through one file at a time:
import PyPDF2 as pdf
i = 0
while i < pdf_reader.getNumPages():
pageinfo = pdf_reader.getPage(i)
print(pageinfo.extractText())
i = i + 1
This is the code that I used to name my directory and print out the file names:
import os
directory = os.listdir('test_files/')
directory = os.listdir('test_files/')
for entry in directory:
print(entry)
**Update, this is what I have so far. Does it seem close?
directory = os.listdir('test_files/')
for entry in directory:
file = open(entry,'rb')
pdf_reader = pdf.PdfFileReader(file)
i = 0
while i < pdf_reader.getNumPages():
pageinfo = pdf_reader.getPage(i)
i = i + 1
data = {'PDF_ID':[entry],
'Text_Data': [pageinfo.extractText()]}
df = pd.DataFrame(data, columns = ['PDF_ID','Text_Data'])
would be ideal, but I haven't found the best way combine them and create a dataframe at the same time. I already have a function created that will clean and tokenize the text, but one file at a time is not ideal. Thanks!

how to save and then extract some information from the file names in dataframe

I have almost 1000000 or even more files in a path.
My final goal is to extract some information from just names of the files.
Till now I have saved the names of the file in a list.
what information in names of the files?
so the format of the names of the file is something like this:
09066271_142468576_1_Haha_-Haha-haha_2016-10-07_haha-false_haha2427.txt
all haha are other different text that does not matter.
I want to extract 09066271 and 2016-10-07 out of the names and save in a dataframe. the first number is always 8 character.
Till now , I have saved the whole text file names in the list:
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
firstly I wanted to save the whole txt file names in the dataframe and then do these operations on them. it seems I have to firstly read to numpy then reshape it to be readable in pandas. however I do not now before what will be the reshape numbers.
df = pd.DataFrame(np.array(file_list).reshape(,))
I would appreciate if you can give me your idea and what will be the efficient way of doing this :)

You can use os to list all of the files. Then just construct a DataFrame and use the string methods to get the parts of the filenames you need.
import pandas as pd
import os
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
df = pd.DataFrame(file_list, columns=['file_name'])
df['data'] = df.file_name.str[0:8]
df['date'] = df.file_name.str.extract('(\d{4}-\d{2}-\d{2})', expand=True)
file_name data date
0 09066271_142468576_1_Haha_-Haha-haha_2016-10-0... 09066271 2016-10-07
1 09014271_142468576_1_Haha_-Haha-haha_2013-02-1... 09014271 2013-02-18

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add text from documents in a folder to an array - python

Related

How can I filter a csv file based on its columns in python?

how to read data (using pandas?) so that it is correctly formatted?

Reading, calculate and group data of several files with pandas

How do I loop through a directory of PDF files and write the information to a Pandas Dataframe in Python?

how to save and then extract some information from the file names in dataframe

Categories

Resources