How to skip remainder of dataframe after blank line?

The following routine retrieves a data file.

wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")

df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

The first three data lines are incomplete so I can slice them off like this df[3:]

At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.

df[df.isnull().all(1)].index.values[0]

How can I find that line and skip the rest of the file?

asked Nov 12 '18 at 23:25

John

1289

1

Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

– Peter Leimbigler
Nov 12 '18 at 23:31

Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

– John
Nov 13 '18 at 0:17

Maybe df[df.isnull().all(1)].iloc[0].index?

– Peter Leimbigler
Nov 13 '18 at 0:23

@PeterLeimbigler It can be done without having to load the data.

– coldspeed
Nov 13 '18 at 0:45

add a comment |

The following routine retrieves a data file.

wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")

df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

The first three data lines are incomplete so I can slice them off like this df[3:]

df[df.isnull().all(1)].index.values[0]

How can I find that line and skip the rest of the file?

asked Nov 12 '18 at 23:25

John

1289

1

Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

– Peter Leimbigler
Nov 12 '18 at 23:31

Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

– John
Nov 13 '18 at 0:17

Maybe df[df.isnull().all(1)].iloc[0].index?

– Peter Leimbigler
Nov 13 '18 at 0:23

@PeterLeimbigler It can be done without having to load the data.

– coldspeed
Nov 13 '18 at 0:45

add a comment |

The following routine retrieves a data file.

wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")

df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

The first three data lines are incomplete so I can slice them off like this df[3:]

df[df.isnull().all(1)].index.values[0]

How can I find that line and skip the rest of the file?

asked Nov 12 '18 at 23:25

John

1289

The following routine retrieves a data file.

wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")

df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

The first three data lines are incomplete so I can slice them off like this df[3:]

df[df.isnull().all(1)].index.values[0]

How can I find that line and skip the rest of the file?

python pandas

asked Nov 12 '18 at 23:25

John

1289

asked Nov 12 '18 at 23:25

John

1289

asked Nov 12 '18 at 23:25

John

1289

asked Nov 12 '18 at 23:25

John

1289

asked Nov 12 '18 at 23:25

John

1289

1

Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

– Peter Leimbigler
Nov 12 '18 at 23:31

Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

– John
Nov 13 '18 at 0:17

Maybe df[df.isnull().all(1)].iloc[0].index?

– Peter Leimbigler
Nov 13 '18 at 0:23

@PeterLeimbigler It can be done without having to load the data.

– coldspeed
Nov 13 '18 at 0:45

add a comment |

1

Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

– Peter Leimbigler
Nov 12 '18 at 23:31

Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

– John
Nov 13 '18 at 0:17

Maybe df[df.isnull().all(1)].iloc[0].index?

– Peter Leimbigler
Nov 13 '18 at 0:23

@PeterLeimbigler It can be done without having to load the data.

– coldspeed
Nov 13 '18 at 0:45

Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

– Peter Leimbigler
Nov 12 '18 at 23:31

Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

– John
Nov 13 '18 at 0:17

Maybe df[df.isnull().all(1)].iloc[0].index?

– Peter Leimbigler
Nov 13 '18 at 0:23

@PeterLeimbigler It can be done without having to load the data.

– coldspeed
Nov 13 '18 at 0:45

add a comment |

1 Answer
1

active

oldest

votes

I think you have two nan-row problems in this file:

The first row after the header is already an empty row leading to a nan index.

The reason for your post here, the empty row which indicates the end of the data you're interested in.

first import the data as you did it:

df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

NaN                      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

then remove first empty row (nan-index), problem No1:

df = df[1:]



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

1987-08-07 00:00:00     0.56     0.15     0.29   ...    323.00  316.23  323.00

And now you want to index all rows before the first nan-index, problem No2.

Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.

This tested against 0 returns a boolean index for your data:

data_idx = df.index.isna().astype(int).cumsum() == 0

Applied to your dataframe:

df[data_idx]

                      Bullish   Neutral   ...         Low    Close

Date                                      ...                     

1987-06-26 00:00:00       NaN       NaN   ...         NaN      NaN

1987-07-17 00:00:00       NaN       NaN   ...      307.63   314.59

1987-07-24 00:00:00  0.360000  0.500000   ...      307.81   309.27

1987-07-31 00:00:00  0.260000  0.480000   ...      310.65   318.66

1987-08-07 00:00:00  0.560000  0.150000   ...      316.23   323.00

                      ...       ...   ...         ...      ...

2018-10-11 00:00:00  0.306061  0.339394   ...     2784.86  2785.68

2018-10-18 00:00:00  0.339350  0.310469   ...     2710.51  2809.21

2018-10-25 00:00:00  0.279693  0.310345   ...     2651.89  2656.10

2018-11-01 00:00:00  0.379310  0.275862   ...     2603.54  2711.74

2018-11-08 00:00:00  0.412844  0.275229   ...     2700.44  2813.89



[1635 rows x 12 columns]

edited Nov 14 '18 at 21:57

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271560%2fhow-to-skip-remainder-of-dataframe-after-blank-line%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I think you have two nan-row problems in this file:

The first row after the header is already an empty row leading to a nan index.

The reason for your post here, the empty row which indicates the end of the data you're interested in.

first import the data as you did it:

df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

NaN                      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

then remove first empty row (nan-index), problem No1:

df = df[1:]



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

1987-08-07 00:00:00     0.56     0.15     0.29   ...    323.00  316.23  323.00

data_idx = df.index.isna().astype(int).cumsum() == 0

Applied to your dataframe:

df[data_idx]

                      Bullish   Neutral   ...         Low    Close

Date                                      ...                     

1987-06-26 00:00:00       NaN       NaN   ...         NaN      NaN

1987-07-17 00:00:00       NaN       NaN   ...      307.63   314.59

1987-07-24 00:00:00  0.360000  0.500000   ...      307.81   309.27

1987-07-31 00:00:00  0.260000  0.480000   ...      310.65   318.66

1987-08-07 00:00:00  0.560000  0.150000   ...      316.23   323.00

                      ...       ...   ...         ...      ...

2018-10-11 00:00:00  0.306061  0.339394   ...     2784.86  2785.68

2018-10-18 00:00:00  0.339350  0.310469   ...     2710.51  2809.21

2018-10-25 00:00:00  0.279693  0.310345   ...     2651.89  2656.10

2018-11-01 00:00:00  0.379310  0.275862   ...     2603.54  2711.74

2018-11-08 00:00:00  0.412844  0.275229   ...     2700.44  2813.89



[1635 rows x 12 columns]

edited Nov 14 '18 at 21:57

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51

add a comment |

I think you have two nan-row problems in this file:

The first row after the header is already an empty row leading to a nan index.

The reason for your post here, the empty row which indicates the end of the data you're interested in.

first import the data as you did it:

df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

NaN                      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

then remove first empty row (nan-index), problem No1:

df = df[1:]



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

1987-08-07 00:00:00     0.56     0.15     0.29   ...    323.00  316.23  323.00

data_idx = df.index.isna().astype(int).cumsum() == 0

Applied to your dataframe:

df[data_idx]

                      Bullish   Neutral   ...         Low    Close

Date                                      ...                     

1987-06-26 00:00:00       NaN       NaN   ...         NaN      NaN

1987-07-17 00:00:00       NaN       NaN   ...      307.63   314.59

1987-07-24 00:00:00  0.360000  0.500000   ...      307.81   309.27

1987-07-31 00:00:00  0.260000  0.480000   ...      310.65   318.66

1987-08-07 00:00:00  0.560000  0.150000   ...      316.23   323.00

                      ...       ...   ...         ...      ...

2018-10-11 00:00:00  0.306061  0.339394   ...     2784.86  2785.68

2018-10-18 00:00:00  0.339350  0.310469   ...     2710.51  2809.21

2018-10-25 00:00:00  0.279693  0.310345   ...     2651.89  2656.10

2018-11-01 00:00:00  0.379310  0.275862   ...     2603.54  2711.74

2018-11-08 00:00:00  0.412844  0.275229   ...     2700.44  2813.89



[1635 rows x 12 columns]

edited Nov 14 '18 at 21:57

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51

add a comment |

I think you have two nan-row problems in this file:

The first row after the header is already an empty row leading to a nan index.

The reason for your post here, the empty row which indicates the end of the data you're interested in.

first import the data as you did it:

df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

NaN                      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

then remove first empty row (nan-index), problem No1:

df = df[1:]



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

1987-08-07 00:00:00     0.56     0.15     0.29   ...    323.00  316.23  323.00

data_idx = df.index.isna().astype(int).cumsum() == 0

Applied to your dataframe:

df[data_idx]

                      Bullish   Neutral   ...         Low    Close

Date                                      ...                     

1987-06-26 00:00:00       NaN       NaN   ...         NaN      NaN

1987-07-17 00:00:00       NaN       NaN   ...      307.63   314.59

1987-07-24 00:00:00  0.360000  0.500000   ...      307.81   309.27

1987-07-31 00:00:00  0.260000  0.480000   ...      310.65   318.66

1987-08-07 00:00:00  0.560000  0.150000   ...      316.23   323.00

                      ...       ...   ...         ...      ...

2018-10-11 00:00:00  0.306061  0.339394   ...     2784.86  2785.68

2018-10-18 00:00:00  0.339350  0.310469   ...     2710.51  2809.21

2018-10-25 00:00:00  0.279693  0.310345   ...     2651.89  2656.10

2018-11-01 00:00:00  0.379310  0.275862   ...     2603.54  2711.74

2018-11-08 00:00:00  0.412844  0.275229   ...     2700.44  2813.89



[1635 rows x 12 columns]

edited Nov 14 '18 at 21:57

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

I think you have two nan-row problems in this file:

The first row after the header is already an empty row leading to a nan index.

The reason for your post here, the empty row which indicates the end of the data you're interested in.

first import the data as you did it:

df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

NaN                      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

then remove first empty row (nan-index), problem No1:

df = df[1:]



df.head()

                     Bullish  Neutral  Bearish   ...      High     Low   Close

Date                                             ...                          

1987-06-26 00:00:00      NaN      NaN      NaN   ...       NaN     NaN     NaN

1987-07-17 00:00:00      NaN      NaN      NaN   ...    314.59  307.63  314.59

1987-07-24 00:00:00     0.36     0.50     0.14   ...    311.39  307.81  309.27

1987-07-31 00:00:00     0.26     0.48     0.26   ...    318.66  310.65  318.66

1987-08-07 00:00:00     0.56     0.15     0.29   ...    323.00  316.23  323.00

data_idx = df.index.isna().astype(int).cumsum() == 0

Applied to your dataframe:

df[data_idx]

                      Bullish   Neutral   ...         Low    Close

Date                                      ...                     

1987-06-26 00:00:00       NaN       NaN   ...         NaN      NaN

1987-07-17 00:00:00       NaN       NaN   ...      307.63   314.59

1987-07-24 00:00:00  0.360000  0.500000   ...      307.81   309.27

1987-07-31 00:00:00  0.260000  0.480000   ...      310.65   318.66

1987-08-07 00:00:00  0.560000  0.150000   ...      316.23   323.00

                      ...       ...   ...         ...      ...

2018-10-11 00:00:00  0.306061  0.339394   ...     2784.86  2785.68

2018-10-18 00:00:00  0.339350  0.310469   ...     2710.51  2809.21

2018-10-25 00:00:00  0.279693  0.310345   ...     2651.89  2656.10

2018-11-01 00:00:00  0.379310  0.275862   ...     2603.54  2711.74

2018-11-08 00:00:00  0.412844  0.275229   ...     2700.44  2813.89



[1635 rows x 12 columns]

edited Nov 14 '18 at 21:57

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

edited Nov 14 '18 at 21:57

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

answered Nov 14 '18 at 21:45

SpghttCd

4,2622313

That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51

add a comment |

That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51

That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nrthugu