Can't store txt file data in Python Dataframe [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 12 months ago.
Improve this question
I am following an article about image caption transformer model in tensor flow python. When I try to run the following code it does not show the data when I use the head function.
file = open(dir_Flickr_text,'r')
text = file.read()
file.close()
datatxt = []
for line in text.split('\n'):
col = line.split('\t')
if len(col) == 1:
continue
w = col[0].split("#")
datatxt.append(w + [col[1].lower()])
data = pd.DataFrame(datatxt,columns["filename","index","caption"])
data = data.reindex(columns =. ['index','filename','caption'])
data = data[data.filename !='2258277193_586949ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)
data.head()
After running this I see three columns (index, filename , caption) with no data at all. While the real file contains enough data and the in the article they display the data too.

It doesn't show any data because the dataframe is empty, probably because datatext is empty. Try using a print() statement before data=pd.DataFrame(... to see what is going on.
It is hard for us to debug without the dataset.

Related

How to validate csv file header using existing schema csv info file [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 days ago.
Improve this question
I am trying to validate input.csv header column names using existing schema_info.csv file
input.csv
emp_id,emp_name,salary
1,siva,1000
2,ravi,200
3,kiran,800
schema_info
file_name,column_name,column_sequence
input.csv,EMP_ID,1
input.csv,EMP_NAME,2
input.csv,SALARY,3
I try to read header and compare with input.csv file header column name and sequence with schema info data. but unable get sequence order from input file header and unable to compare with Schema file data.. Any suggestions?
input = sc.textFile("examples/src/main/resources/people.txt")
input = input.first()
parts = input.map(lambda l: l.split(","))
# Each line is converted to a tuple.
header_data = parts.map(lambda p: (p[0], p[1].strip()))
schema_info = spark.read.option("header","true").option("inferSchema","true").csv("/schema_info.csv")

Filter csv file without using panda or csv module PYTHON [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Without using any modules (panda, csv, etc) I need to filter this csv file (https://data.world/prasert/rotten-tomatoes-top-movies-by-genre/workspace/file?filename=rotten_tomatoes_top_movies_2019-01-15.csv), I would like to filter through ONLY the movies that are in the animation genre and drop the other movies.
I have used open, split and the for loop to read the data, but I am struggling to filter the movies into genres.
I have created a list called genres, and then appended it with genres.append(line.split(",") [4]), but this only gives a list of genres from the genre column rather than giving me info of each movie in a particular genre.
I know it is crazy to attempt this without the modules(this is for school), but is it even possible to do this without them?
Thanks in advance.
try this.
f = open("file_name", "r",encoding="utf-8")
new_list=[]
header=0
for line in f.readlines():
#if header is present in the file
if header==0:
new_list.append(line)
header=1
continue
#add Genre Name to filter
if line.split(',')[4]=='genre_name':
new_list.append(line)
#writing filtered list to output file.
out_flie=open('output.txt','w',encoding="utf-8")
for element in new_list:
out_flie.write(element)
out_flie.write('\n')
out_flie.close()

Python pandas read_excel missing rows [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I used pandas to read a lot of datasets from bloomberg.
When I tested the reading program I noticed that pandas wasn't reading all rows, but it skipped some ones.
The code is the following:
def data_read(data_files):
data = {}
#Read all data and add it to a dictionary filename -> content
for file in data_files:
file_key=file.split('/')[-1][:-5]
data[file_key] = {}
#Foreach sheet add data sheet -> data
for sheet_key in data_to_take:
#path+"/"
data[file_key][sheet_key] = pnd.read_excel(file, sheet_name=sheet_key)
return data

How to Avoid Duplicate Data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
while True:
if bbs_number > lately_number():
sys.stdout = open('date.txt','a')
bbs_lists = range(highest_number() +1, bbs_number +1)
for item in bbs_lists:
url_number = "url" + str(item)
try:
result = requests.get(url_number)
bs_number = BeautifulSoup(result.content, "lxml")
float_box = bs_number.find("div", {"class": "float_box"})
parameter_script = float_box
print("bs_obj()")
except AttributeError as e:
print("error")
with open('lately_number.txt', 'w') as f_last:
f_last.write(str(bbs_number))
Using the while statement above does not cause an error, but duplicate data will be output to date.txt.
I want to modify in the early stages of setting the range value, rather than removing duplicates in the later stages of typing in date.txt.
One possibility is that the existing lately_number() will output a duplicate range to date.txt, because sometimes it is not possible to enter the value correctly in the writing process of lately_number.txt.
I would be grateful if you can help me with a better function expression to add or replace.
The simplest way would be to read the date.txt into a set. Then, you can check the set to see if the date is already in there, and if it isn't, write the date to the date.txt file.
E.G.
uniqueDates = set()
#read file contents into a set.
with open("date.txt", "r") as f:
for line in f:
uniqueDates.add(line.strip()) #strip off the line ending \n
#ensure what we're writing to the date file isn't a duplicate.
with open("date.txt", "a") as f:
if("bs_obj()" not in uniqueDates):
f.write("bs_obj")
You'll probably need to adjust the logic a bit to fit your needs, but, I believe this is what you're trying to accomplish?

how to fetch only content from table by avoiding unwanted codes [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I was trying to fetch text content from table which works well but along with result it print unwanted codes
my code is here
searchitem = searchme.objects.filter(face = after) .values_list ("tale" , flat = True)
the contents are text
the result I receive is "querySet Prabhakaran seachitem"
but I only want o get result "Prabhakaran"
model is this
class searchme ( models.Model):
face = models.TextField()
tale = models.TextField ()
From the official django documentation :
A common need is to get a specific field value of a certain model instance. To achieve that, use values_list() followed by a get() call:
So use:
searchme.objects.values_list('tale', flat=True).get(face=after)

Categories