I am trying to read an excel file and write every fourth row into a new Excel file. I'm using Pandas to read and write, and if int(num%4) == 0 to determine which rows to select, but the iteration and subsequent writing continue to escape me. I've tried my best to look up answers, but I'm a new programmer and struggling :/
If you're using Pandas I'm assuming you've loaded the data into a dataframe?
If so then consider this:
import pandas as pd
df = pd.read_csv('YourFile.csv')
df.iloc[::4]
#once you're done with the data you can save it to another csv file
df.to_csv('OutputFile.csv')
This will leave your dataframe df with the 4th, 8th, 12th, etc. rows from your original dataframe/file. You can then read/write to each row left in the dataframe df. To visualize the before and after just insert df.head() before and after the df.iloc[::4] expression.
I did not understand what the problem is to be more specific, but you should try pandas' iloc property (or even loc depending on your df), check more info in here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
Related
I have a small dataframe that I would like to sort in Python. When sorting in Python, I get a different result then when sorting in Excel. I would like to sort from A to Z and have the pandas result match what is outputted in Excel.
Here is the code I used:
import pandas as pd
df = pd.DataFrame({"Col": ['0123A', '0123B', '01-AB']})
df = df.sort_values('Col', ascending=True)
Here is the output in python:
Col
2 01-AB
0 0123A
1 0123B
My output in excel is this:
Col
0 0123A
1 0123B
2 01-AB
Is there a reason why the pandas and excel result don't match?
Yes. There is a reason that the pandas sort and the Excel sort are different. The pandas sort is utilizing an ASCIII/UTF sorting hierarchy, which probably is what you expected; whereas, Excel treats the minus sign/hyphen differently in the sorting process. If you want the Excel spreadsheet to sort in the same manner as the pandas sort, you need to utilize some additional definitions and processing like what is explained if you connect to this link.
Excel Hyphen Sorting
You might need to read through the solution a few times for it to register, but sorting columns where cell contents can contain a hyphen requires a bit of extra work in Excel.
I hope that helps.
Regards.
I have a classic panda data frame made of ID and Text. I would like to get just one column and therefore i use the typical df["columnname"]. But at this point it becomes a Pandas Series. Is there a way to make a new dataframe with just that single column?
I'm asking this is because if I cast the Pandas series in a string (columnname = columnname.astype ("string")) and I save it in a text file, I see that it only saves the first sentence of each line and not the entire textual content, as I would like.
If there are any other solution, I'm open to learn :)
Try this: pd.DataFrame(dfname["columnname"])
I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck
I know how to read the whole dataset, I know how to read a part of it, but it always reads all of the columns from my excel file. I do it like this:
myfile = pd.ExcelFile('my_file.xlsx')
myfile.parse(2, skiprows=14, skipfooter= 2).dropna(axis=1, how='all')
But I can not read only one specific cell this way, because it read the whole row. Is there a way to limit the parser to one column?
UPDATE:
looking for a Pandas solution
Update your pandas to 0.24.2:
Docs: read_excel, specifically read usecols
I believe you will need to use a combination of skiprows and skipfooter to narrow down to specific row and usecols to get the column. This way you will get the specific cells value.
I have an existing excel. That looks like
and I have another excel that has around 40000 rows and around 300 columns. shortened version looks like
I would like to append values to my existing excel from second excel. But only values that match values in col4 from my existing excel. So i would get something like this
Hope you guys get the picture of what I am trying to do.
yes, that is possible in pandas and it is way faster than anything in excel
df_result = pd.merge(FirstTable, SecondTable, how='left', on='col4')
this will look into both the tables for column "col4" so it needs to be named this way in both the tables.
Also be aware of the fact that if you have multiple values in second table for single value in the first table it will make as many lines in the result as in the second table.
to read the excel you can use:
import pandas as pd
xl=pd.ExcelFile('MyFile.xlsx')
FirstTable = pd.read_excel(xl, 'sheet_name_FIRST_TABLE')
for more detailed description see documentation