I have a classic panda data frame made of ID and Text. I would like to get just one column and therefore i use the typical df["columnname"]. But at this point it becomes a Pandas Series. Is there a way to make a new dataframe with just that single column?
I'm asking this is because if I cast the Pandas series in a string (columnname = columnname.astype ("string")) and I save it in a text file, I see that it only saves the first sentence of each line and not the entire textual content, as I would like.
If there are any other solution, I'm open to learn :)
Try this: pd.DataFrame(dfname["columnname"])
Related
I am trying to read an excel file and write every fourth row into a new Excel file. I'm using Pandas to read and write, and if int(num%4) == 0 to determine which rows to select, but the iteration and subsequent writing continue to escape me. I've tried my best to look up answers, but I'm a new programmer and struggling :/
If you're using Pandas I'm assuming you've loaded the data into a dataframe?
If so then consider this:
import pandas as pd
df = pd.read_csv('YourFile.csv')
df.iloc[::4]
#once you're done with the data you can save it to another csv file
df.to_csv('OutputFile.csv')
This will leave your dataframe df with the 4th, 8th, 12th, etc. rows from your original dataframe/file. You can then read/write to each row left in the dataframe df. To visualize the before and after just insert df.head() before and after the df.iloc[::4] expression.
I did not understand what the problem is to be more specific, but you should try pandas' iloc property (or even loc depending on your df), check more info in here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
I save my DataFrame as csv and try to open it in excel, problem is that excel converts some of my float data to date format. I use excel 2016.
This is how my DataFrame looks like in excel.
Does anyone have an idea how to stop this ?
You have to select the required column and then press CNT + 1 and then select the correct format. As you are saving the file as CSV, you have to repeat this action every time you open the file as CSV don't save such information and by default excel reads everything as generic format. You can find more details here
If you use Excel to open a CSV file it will attempt to interpret each cell. Something that resembles a date will be formatted as a date. Excel has the same behaviour if you type or paste something that resembles a date into a cell formatted as General.
However, if you paste the same data into a cell that has already been formatted other than General it will no longer be re-interpreted.
Format a blank Excel sheet as you expect the data to appear. Open the CSV file in a text editor such as Notepad. Copy the data then paste it into the Excel sheet.
If you aren't sure how the data should appear, for example because you aren't sure about the number of columns, you can format all of the cells as Text. That will suppress interpretation but you can change the formatting afterwards.
Incidentally, I discovered a bug in Excel that relates to this. When you add a new row to the bottom of a table it inherits the formatting of the row above, however Excel does this in the wrong order. To see this, format a table column as Text. In the row below the last row of the table, formatted General, type '1/1/2022'. Excel misinterprets this as 44562. That is because it interpreted 1/1/2022 as a date then changed the formatting to Text to match the row above.
Consequently, when applying the initial formatting you should select at least as many rows as in your CSV file. The easiest way to achieve this is simply to format entire columns.
In your particular case you probably want to pre-format certain columns as Number.
I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck
I know how to read the whole dataset, I know how to read a part of it, but it always reads all of the columns from my excel file. I do it like this:
myfile = pd.ExcelFile('my_file.xlsx')
myfile.parse(2, skiprows=14, skipfooter= 2).dropna(axis=1, how='all')
But I can not read only one specific cell this way, because it read the whole row. Is there a way to limit the parser to one column?
UPDATE:
looking for a Pandas solution
Update your pandas to 0.24.2:
Docs: read_excel, specifically read usecols
I believe you will need to use a combination of skiprows and skipfooter to narrow down to specific row and usecols to get the column. This way you will get the specific cells value.
I am having an excel file and in that one row of column Model is having value "9-3" which is a string value. I double-checked the excel file to have the column datatype as Plain string instead of Date. But still When I use read_excel and convert it into a data frame, the value is shown as 2017-09-03 00:00:00 instead of string "9-3".
Here is how I read the excel file:
table = pd.read_excel('ManualProfitAdjustmentUpdates.xlsx' , header=0, converters={'Model': str})
Any idea on why pandas is not treating value as string even when I set the converters as str?
The Plain string setting in the excel file affects only how the data is shown in Excel.
The str setting in the converter affects only how it treats the data that it gets.
To force the excel file to return the data as string, the cell's first character should be an apostrophe.
Change "9-3" to "'9-3".
The problem may be with excel. Make sure the entire column is stored as text and not just the singular value you are talking about. If excel had the column saved as a data at any point it will store a year in that cell no matter what is shown or what the datatype is changed too. Pandas is going to read the entire column as one data type so if you have dates above 9-3 it will be converted. Changing dates to strings without years can be tricky. It may be better to save the excel sheet as a csv once it is in the proper format you like and then use pandas pd.read_csv(). I made a test excel workbook "book1.xlsx"
9-3 1 Hello
12-1 2 World
1-8 3 Test
Then ran
import pandas as pd
df = pd.read_excel('book1.xlsx',header=0)
print(df)
and got back my data frame correctly. Thus, I am led to believe it is excel. Sorry is isn't the best answer but I don't believe it is a pandas error.