How to skip remainder of dataframe after blank line?












0















The following routine retrieves a data file.



wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")
df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')


The first three data lines are incomplete so I can slice them off like this df[3:]



At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.



df[df.isnull().all(1)].index.values[0]


How can I find that line and skip the rest of the file?










share|improve this question


















  • 1





    Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

    – Peter Leimbigler
    Nov 12 '18 at 23:31











  • Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

    – John
    Nov 13 '18 at 0:17











  • Maybe df[df.isnull().all(1)].iloc[0].index?

    – Peter Leimbigler
    Nov 13 '18 at 0:23











  • @PeterLeimbigler It can be done without having to load the data.

    – coldspeed
    Nov 13 '18 at 0:45
















0















The following routine retrieves a data file.



wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")
df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')


The first three data lines are incomplete so I can slice them off like this df[3:]



At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.



df[df.isnull().all(1)].index.values[0]


How can I find that line and skip the rest of the file?










share|improve this question


















  • 1





    Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

    – Peter Leimbigler
    Nov 12 '18 at 23:31











  • Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

    – John
    Nov 13 '18 at 0:17











  • Maybe df[df.isnull().all(1)].iloc[0].index?

    – Peter Leimbigler
    Nov 13 '18 at 0:23











  • @PeterLeimbigler It can be done without having to load the data.

    – coldspeed
    Nov 13 '18 at 0:45














0












0








0








The following routine retrieves a data file.



wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")
df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')


The first three data lines are incomplete so I can slice them off like this df[3:]



At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.



df[df.isnull().all(1)].index.values[0]


How can I find that line and skip the rest of the file?










share|improve this question














The following routine retrieves a data file.



wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")
df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')


The first three data lines are incomplete so I can slice them off like this df[3:]



At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.



df[df.isnull().all(1)].index.values[0]


How can I find that line and skip the rest of the file?







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 12 '18 at 23:25









JohnJohn

1289




1289








  • 1





    Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

    – Peter Leimbigler
    Nov 12 '18 at 23:31











  • Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

    – John
    Nov 13 '18 at 0:17











  • Maybe df[df.isnull().all(1)].iloc[0].index?

    – Peter Leimbigler
    Nov 13 '18 at 0:23











  • @PeterLeimbigler It can be done without having to load the data.

    – coldspeed
    Nov 13 '18 at 0:45














  • 1





    Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

    – Peter Leimbigler
    Nov 12 '18 at 23:31











  • Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

    – John
    Nov 13 '18 at 0:17











  • Maybe df[df.isnull().all(1)].iloc[0].index?

    – Peter Leimbigler
    Nov 13 '18 at 0:23











  • @PeterLeimbigler It can be done without having to load the data.

    – coldspeed
    Nov 13 '18 at 0:45








1




1





Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

– Peter Leimbigler
Nov 12 '18 at 23:31





Try df.dropna(), which by default drops any rows that contain one or more NaN (missing) values.

– Peter Leimbigler
Nov 12 '18 at 23:31













Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

– John
Nov 13 '18 at 0:17





Unfortunately there are good rows with na values so that won't work. What I need is to find that first blank row and drop it and all after it.

– John
Nov 13 '18 at 0:17













Maybe df[df.isnull().all(1)].iloc[0].index?

– Peter Leimbigler
Nov 13 '18 at 0:23





Maybe df[df.isnull().all(1)].iloc[0].index?

– Peter Leimbigler
Nov 13 '18 at 0:23













@PeterLeimbigler It can be done without having to load the data.

– coldspeed
Nov 13 '18 at 0:45





@PeterLeimbigler It can be done without having to load the data.

– coldspeed
Nov 13 '18 at 0:45












1 Answer
1






active

oldest

votes


















1














I think you have two nan-row problems in this file:




  1. The first row after the header is already an empty row leading to a nan index.

  2. The reason for your post here, the empty row which indicates the end of the data you're interested in.


first import the data as you did it:



df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66


then remove first empty row (nan-index), problem No1:



df = df[1:]

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00


And now you want to index all rows before the first nan-index, problem No2.

Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.

This tested against 0 returns a boolean index for your data:



data_idx = df.index.isna().astype(int).cumsum() == 0


Applied to your dataframe:



df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89

[1635 rows x 12 columns]





share|improve this answer


























  • That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

    – John
    Nov 15 '18 at 18:51











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271560%2fhow-to-skip-remainder-of-dataframe-after-blank-line%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














I think you have two nan-row problems in this file:




  1. The first row after the header is already an empty row leading to a nan index.

  2. The reason for your post here, the empty row which indicates the end of the data you're interested in.


first import the data as you did it:



df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66


then remove first empty row (nan-index), problem No1:



df = df[1:]

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00


And now you want to index all rows before the first nan-index, problem No2.

Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.

This tested against 0 returns a boolean index for your data:



data_idx = df.index.isna().astype(int).cumsum() == 0


Applied to your dataframe:



df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89

[1635 rows x 12 columns]





share|improve this answer


























  • That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

    – John
    Nov 15 '18 at 18:51
















1














I think you have two nan-row problems in this file:




  1. The first row after the header is already an empty row leading to a nan index.

  2. The reason for your post here, the empty row which indicates the end of the data you're interested in.


first import the data as you did it:



df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66


then remove first empty row (nan-index), problem No1:



df = df[1:]

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00


And now you want to index all rows before the first nan-index, problem No2.

Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.

This tested against 0 returns a boolean index for your data:



data_idx = df.index.isna().astype(int).cumsum() == 0


Applied to your dataframe:



df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89

[1635 rows x 12 columns]





share|improve this answer


























  • That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

    – John
    Nov 15 '18 at 18:51














1












1








1







I think you have two nan-row problems in this file:




  1. The first row after the header is already an empty row leading to a nan index.

  2. The reason for your post here, the empty row which indicates the end of the data you're interested in.


first import the data as you did it:



df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66


then remove first empty row (nan-index), problem No1:



df = df[1:]

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00


And now you want to index all rows before the first nan-index, problem No2.

Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.

This tested against 0 returns a boolean index for your data:



data_idx = df.index.isna().astype(int).cumsum() == 0


Applied to your dataframe:



df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89

[1635 rows x 12 columns]





share|improve this answer















I think you have two nan-row problems in this file:




  1. The first row after the header is already an empty row leading to a nan index.

  2. The reason for your post here, the empty row which indicates the end of the data you're interested in.


first import the data as you did it:



df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66


then remove first empty row (nan-index), problem No1:



df = df[1:]

df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00


And now you want to index all rows before the first nan-index, problem No2.

Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.

This tested against 0 returns a boolean index for your data:



data_idx = df.index.isna().astype(int).cumsum() == 0


Applied to your dataframe:



df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89

[1635 rows x 12 columns]






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 14 '18 at 21:57

























answered Nov 14 '18 at 21:45









SpghttCdSpghttCd

4,2622313




4,2622313













  • That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

    – John
    Nov 15 '18 at 18:51



















  • That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

    – John
    Nov 15 '18 at 18:51

















That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51





That looks like it'll solve the index problem I couldn't get around. Will test and report as soon as I get a moment.

– John
Nov 15 '18 at 18:51


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271560%2fhow-to-skip-remainder-of-dataframe-after-blank-line%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Full-time equivalent

Bicuculline

さくらももこ