Parsing a string into a nested data.table
I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:
item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz
My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:
foo | bar | baz
----+-----+----
1 | 2 | 1
I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:
code | reason
-----+-------
1 | foo
2 | bar
3 | baz
Here's how I tried to do it, using data.table
and stringr
.
encounter_alerts[, whys := lapply(
str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
)]
I am very confused by the error message I get when I try to do this:
Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code", :
file not found: 1=foo
I am explicitly using text
rather than file
so I'm not sure how it's trying to interpret the line of text as a filename!
When I test this with a single row, it seems to work fine:
> fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
code reason
1: 1 foo
2: 2 bar
What am I doing wrong? Is there a better way to do this?
Thanks!
r data.table
add a comment |
I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:
item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz
My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:
foo | bar | baz
----+-----+----
1 | 2 | 1
I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:
code | reason
-----+-------
1 | foo
2 | bar
3 | baz
Here's how I tried to do it, using data.table
and stringr
.
encounter_alerts[, whys := lapply(
str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
)]
I am very confused by the error message I get when I try to do this:
Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code", :
file not found: 1=foo
I am explicitly using text
rather than file
so I'm not sure how it's trying to interpret the line of text as a filename!
When I test this with a single row, it seems to work fine:
> fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
code reason
1: 1 foo
2: 2 bar
What am I doing wrong? Is there a better way to do this?
Thanks!
r data.table
add a comment |
I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:
item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz
My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:
foo | bar | baz
----+-----+----
1 | 2 | 1
I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:
code | reason
-----+-------
1 | foo
2 | bar
3 | baz
Here's how I tried to do it, using data.table
and stringr
.
encounter_alerts[, whys := lapply(
str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
)]
I am very confused by the error message I get when I try to do this:
Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code", :
file not found: 1=foo
I am explicitly using text
rather than file
so I'm not sure how it's trying to interpret the line of text as a filename!
When I test this with a single row, it seems to work fine:
> fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
code reason
1: 1 foo
2: 2 bar
What am I doing wrong? Is there a better way to do this?
Thanks!
r data.table
I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:
item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz
My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:
foo | bar | baz
----+-----+----
1 | 2 | 1
I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:
code | reason
-----+-------
1 | foo
2 | bar
3 | baz
Here's how I tried to do it, using data.table
and stringr
.
encounter_alerts[, whys := lapply(
str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
)]
I am very confused by the error message I get when I try to do this:
Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code", :
file not found: 1=foo
I am explicitly using text
rather than file
so I'm not sure how it's trying to interpret the line of text as a filename!
When I test this with a single row, it seems to work fine:
> fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
code reason
1: 1 foo
2: 2 bar
What am I doing wrong? Is there a better way to do this?
Thanks!
r data.table
r data.table
edited Nov 14 '18 at 1:53
Nicholas Riley
asked Nov 13 '18 at 3:55
Nicholas RileyNicholas Riley
37.1k586114
37.1k586114
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
Note: Edited after reading comments
From your comment, I tried to reproduce what I understand your data might look like.
library(tidyverse)
df <- tibble(
strings = c("item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz",
"item A...2
item B...2
item C...3
item D...1
1=toto
2=foo
3=lala",
"item A...3
item B...3
item C...3
item D...1
1=tutu
3=ttt")
)
Code:
get_ref <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("=") %>%
str_split_fixed("=", 2) %>%
as_tibble() %>%
rename(code = V1, reason = V2)
}
list1 <- map(df$strings, get_ref)
get_value <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("\.\.\.") %>%
str_replace_all(".*\.\.\.", "") %>%
as_tibble() %>%
rename(code = value)
}
list2 <- map(df$strings, get_value)
get_result <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
result <- map2_df(list1, list2, get_result)
result[is.na(result)] <- 0
result
Result
# A tibble: 3 x 7
bar baz foo lala toto ttt tutu
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 0 0 0 0
2 0 0 2 1 1 0 0
3 0 0 0 0 0 3 1
I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
– Nicholas Riley
Nov 13 '18 at 14:32
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
– prosoitos
Nov 13 '18 at 16:05
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
– prosoitos
Nov 13 '18 at 16:06
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it topmap_df()
to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
– prosoitos
Nov 13 '18 at 16:09
I edited my answer to match what I now understand your data might look like
– prosoitos
Nov 13 '18 at 17:01
|
show 4 more comments
using stringr and dplyr you can do it easily
library(stringr)
library(dplyr)
v <- as.data.frame(c( "item A...1",
"item B...2",
"item C...3",
"item D...2"))
colnames(v)<- "items"
matching <- c( "1",
"2",
"3")
Mapping <- read.table(text="code reason
1 foo
2 bar
3 baz
", header = T)
## Answer
df1<- v %>%
mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
str(df1)
str(Mapping)
df1$code <- as.numeric(df1$code )
df1 <- left_join(df1,Mapping)
please have a look
add a comment |
There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).
sample_str <- 'item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz'
lines <- stringr::str_split(sample_str, 'n', simplify = T)
extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]
dfs_list <- lapply(extracted_strs, function(x) {
str_parts <- stringr::str_split(x, '=', simplify = T)
df_args = list()
df_args[[str_parts[2]]] = as.integer(str_parts[1])
df_args[['stringsAsFactors']] = F
do.call(data.frame, df_args)
})
df <- do.call(cbind, dfs)
Thanks. This creates adata.frame
with a variable forfoo
,bar
andbar
. Would this be easier to use later on than what I created as my example?
– Nicholas Riley
Nov 13 '18 at 14:30
add a comment |
Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...
get_code_reason_mapping <- function(alert_text) {
alert_text %>%
str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
unlist() %>%
str_split_fixed("=", 2) %>%
as.data.table() %>%
setnames(c('code', 'reason'))
}
encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)
get_why_codes <- function(alert_text) {
alert_text %>%
str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
unlist() %>%
str_sub(-1) %>%
as.data.table() %>%
setnames(c('code'))
}
encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)
get_code_counts <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)
code_counts[is.na(code_counts)] <- 0
code_counts
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53273549%2fparsing-a-string-into-a-nested-data-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Note: Edited after reading comments
From your comment, I tried to reproduce what I understand your data might look like.
library(tidyverse)
df <- tibble(
strings = c("item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz",
"item A...2
item B...2
item C...3
item D...1
1=toto
2=foo
3=lala",
"item A...3
item B...3
item C...3
item D...1
1=tutu
3=ttt")
)
Code:
get_ref <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("=") %>%
str_split_fixed("=", 2) %>%
as_tibble() %>%
rename(code = V1, reason = V2)
}
list1 <- map(df$strings, get_ref)
get_value <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("\.\.\.") %>%
str_replace_all(".*\.\.\.", "") %>%
as_tibble() %>%
rename(code = value)
}
list2 <- map(df$strings, get_value)
get_result <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
result <- map2_df(list1, list2, get_result)
result[is.na(result)] <- 0
result
Result
# A tibble: 3 x 7
bar baz foo lala toto ttt tutu
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 0 0 0 0
2 0 0 2 1 1 0 0
3 0 0 0 0 0 3 1
I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
– Nicholas Riley
Nov 13 '18 at 14:32
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
– prosoitos
Nov 13 '18 at 16:05
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
– prosoitos
Nov 13 '18 at 16:06
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it topmap_df()
to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
– prosoitos
Nov 13 '18 at 16:09
I edited my answer to match what I now understand your data might look like
– prosoitos
Nov 13 '18 at 17:01
|
show 4 more comments
Note: Edited after reading comments
From your comment, I tried to reproduce what I understand your data might look like.
library(tidyverse)
df <- tibble(
strings = c("item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz",
"item A...2
item B...2
item C...3
item D...1
1=toto
2=foo
3=lala",
"item A...3
item B...3
item C...3
item D...1
1=tutu
3=ttt")
)
Code:
get_ref <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("=") %>%
str_split_fixed("=", 2) %>%
as_tibble() %>%
rename(code = V1, reason = V2)
}
list1 <- map(df$strings, get_ref)
get_value <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("\.\.\.") %>%
str_replace_all(".*\.\.\.", "") %>%
as_tibble() %>%
rename(code = value)
}
list2 <- map(df$strings, get_value)
get_result <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
result <- map2_df(list1, list2, get_result)
result[is.na(result)] <- 0
result
Result
# A tibble: 3 x 7
bar baz foo lala toto ttt tutu
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 0 0 0 0
2 0 0 2 1 1 0 0
3 0 0 0 0 0 3 1
I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
– Nicholas Riley
Nov 13 '18 at 14:32
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
– prosoitos
Nov 13 '18 at 16:05
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
– prosoitos
Nov 13 '18 at 16:06
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it topmap_df()
to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
– prosoitos
Nov 13 '18 at 16:09
I edited my answer to match what I now understand your data might look like
– prosoitos
Nov 13 '18 at 17:01
|
show 4 more comments
Note: Edited after reading comments
From your comment, I tried to reproduce what I understand your data might look like.
library(tidyverse)
df <- tibble(
strings = c("item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz",
"item A...2
item B...2
item C...3
item D...1
1=toto
2=foo
3=lala",
"item A...3
item B...3
item C...3
item D...1
1=tutu
3=ttt")
)
Code:
get_ref <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("=") %>%
str_split_fixed("=", 2) %>%
as_tibble() %>%
rename(code = V1, reason = V2)
}
list1 <- map(df$strings, get_ref)
get_value <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("\.\.\.") %>%
str_replace_all(".*\.\.\.", "") %>%
as_tibble() %>%
rename(code = value)
}
list2 <- map(df$strings, get_value)
get_result <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
result <- map2_df(list1, list2, get_result)
result[is.na(result)] <- 0
result
Result
# A tibble: 3 x 7
bar baz foo lala toto ttt tutu
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 0 0 0 0
2 0 0 2 1 1 0 0
3 0 0 0 0 0 3 1
Note: Edited after reading comments
From your comment, I tried to reproduce what I understand your data might look like.
library(tidyverse)
df <- tibble(
strings = c("item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz",
"item A...2
item B...2
item C...3
item D...1
1=toto
2=foo
3=lala",
"item A...3
item B...3
item C...3
item D...1
1=tutu
3=ttt")
)
Code:
get_ref <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("=") %>%
str_split_fixed("=", 2) %>%
as_tibble() %>%
rename(code = V1, reason = V2)
}
list1 <- map(df$strings, get_ref)
get_value <- function(string) {
string %>%
str_split("n") %>%
unlist() %>%
str_subset("\.\.\.") %>%
str_replace_all(".*\.\.\.", "") %>%
as_tibble() %>%
rename(code = value)
}
list2 <- map(df$strings, get_value)
get_result <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
result <- map2_df(list1, list2, get_result)
result[is.na(result)] <- 0
result
Result
# A tibble: 3 x 7
bar baz foo lala toto ttt tutu
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 0 0 0 0
2 0 0 2 1 1 0 0
3 0 0 0 0 0 3 1
edited Nov 13 '18 at 17:00
answered Nov 13 '18 at 7:16
prosoitosprosoitos
935419
935419
I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
– Nicholas Riley
Nov 13 '18 at 14:32
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
– prosoitos
Nov 13 '18 at 16:05
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
– prosoitos
Nov 13 '18 at 16:06
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it topmap_df()
to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
– prosoitos
Nov 13 '18 at 16:09
I edited my answer to match what I now understand your data might look like
– prosoitos
Nov 13 '18 at 17:01
|
show 4 more comments
I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
– Nicholas Riley
Nov 13 '18 at 14:32
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
– prosoitos
Nov 13 '18 at 16:05
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
– prosoitos
Nov 13 '18 at 16:06
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it topmap_df()
to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
– prosoitos
Nov 13 '18 at 16:09
I edited my answer to match what I now understand your data might look like
– prosoitos
Nov 13 '18 at 17:01
I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
– Nicholas Riley
Nov 13 '18 at 14:32
I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
– Nicholas Riley
Nov 13 '18 at 14:32
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
– prosoitos
Nov 13 '18 at 16:05
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
– prosoitos
Nov 13 '18 at 16:05
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
– prosoitos
Nov 13 '18 at 16:06
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
– prosoitos
Nov 13 '18 at 16:06
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to
pmap_df()
to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.– prosoitos
Nov 13 '18 at 16:09
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to
pmap_df()
to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.– prosoitos
Nov 13 '18 at 16:09
I edited my answer to match what I now understand your data might look like
– prosoitos
Nov 13 '18 at 17:01
I edited my answer to match what I now understand your data might look like
– prosoitos
Nov 13 '18 at 17:01
|
show 4 more comments
using stringr and dplyr you can do it easily
library(stringr)
library(dplyr)
v <- as.data.frame(c( "item A...1",
"item B...2",
"item C...3",
"item D...2"))
colnames(v)<- "items"
matching <- c( "1",
"2",
"3")
Mapping <- read.table(text="code reason
1 foo
2 bar
3 baz
", header = T)
## Answer
df1<- v %>%
mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
str(df1)
str(Mapping)
df1$code <- as.numeric(df1$code )
df1 <- left_join(df1,Mapping)
please have a look
add a comment |
using stringr and dplyr you can do it easily
library(stringr)
library(dplyr)
v <- as.data.frame(c( "item A...1",
"item B...2",
"item C...3",
"item D...2"))
colnames(v)<- "items"
matching <- c( "1",
"2",
"3")
Mapping <- read.table(text="code reason
1 foo
2 bar
3 baz
", header = T)
## Answer
df1<- v %>%
mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
str(df1)
str(Mapping)
df1$code <- as.numeric(df1$code )
df1 <- left_join(df1,Mapping)
please have a look
add a comment |
using stringr and dplyr you can do it easily
library(stringr)
library(dplyr)
v <- as.data.frame(c( "item A...1",
"item B...2",
"item C...3",
"item D...2"))
colnames(v)<- "items"
matching <- c( "1",
"2",
"3")
Mapping <- read.table(text="code reason
1 foo
2 bar
3 baz
", header = T)
## Answer
df1<- v %>%
mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
str(df1)
str(Mapping)
df1$code <- as.numeric(df1$code )
df1 <- left_join(df1,Mapping)
please have a look
using stringr and dplyr you can do it easily
library(stringr)
library(dplyr)
v <- as.data.frame(c( "item A...1",
"item B...2",
"item C...3",
"item D...2"))
colnames(v)<- "items"
matching <- c( "1",
"2",
"3")
Mapping <- read.table(text="code reason
1 foo
2 bar
3 baz
", header = T)
## Answer
df1<- v %>%
mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
str(df1)
str(Mapping)
df1$code <- as.numeric(df1$code )
df1 <- left_join(df1,Mapping)
please have a look
answered Nov 13 '18 at 4:05
HunaidkhanHunaidkhan
811114
811114
add a comment |
add a comment |
There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).
sample_str <- 'item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz'
lines <- stringr::str_split(sample_str, 'n', simplify = T)
extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]
dfs_list <- lapply(extracted_strs, function(x) {
str_parts <- stringr::str_split(x, '=', simplify = T)
df_args = list()
df_args[[str_parts[2]]] = as.integer(str_parts[1])
df_args[['stringsAsFactors']] = F
do.call(data.frame, df_args)
})
df <- do.call(cbind, dfs)
Thanks. This creates adata.frame
with a variable forfoo
,bar
andbar
. Would this be easier to use later on than what I created as my example?
– Nicholas Riley
Nov 13 '18 at 14:30
add a comment |
There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).
sample_str <- 'item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz'
lines <- stringr::str_split(sample_str, 'n', simplify = T)
extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]
dfs_list <- lapply(extracted_strs, function(x) {
str_parts <- stringr::str_split(x, '=', simplify = T)
df_args = list()
df_args[[str_parts[2]]] = as.integer(str_parts[1])
df_args[['stringsAsFactors']] = F
do.call(data.frame, df_args)
})
df <- do.call(cbind, dfs)
Thanks. This creates adata.frame
with a variable forfoo
,bar
andbar
. Would this be easier to use later on than what I created as my example?
– Nicholas Riley
Nov 13 '18 at 14:30
add a comment |
There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).
sample_str <- 'item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz'
lines <- stringr::str_split(sample_str, 'n', simplify = T)
extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]
dfs_list <- lapply(extracted_strs, function(x) {
str_parts <- stringr::str_split(x, '=', simplify = T)
df_args = list()
df_args[[str_parts[2]]] = as.integer(str_parts[1])
df_args[['stringsAsFactors']] = F
do.call(data.frame, df_args)
})
df <- do.call(cbind, dfs)
There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).
sample_str <- 'item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz'
lines <- stringr::str_split(sample_str, 'n', simplify = T)
extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]
dfs_list <- lapply(extracted_strs, function(x) {
str_parts <- stringr::str_split(x, '=', simplify = T)
df_args = list()
df_args[[str_parts[2]]] = as.integer(str_parts[1])
df_args[['stringsAsFactors']] = F
do.call(data.frame, df_args)
})
df <- do.call(cbind, dfs)
answered Nov 13 '18 at 5:33
Eric BurdenEric Burden
512
512
Thanks. This creates adata.frame
with a variable forfoo
,bar
andbar
. Would this be easier to use later on than what I created as my example?
– Nicholas Riley
Nov 13 '18 at 14:30
add a comment |
Thanks. This creates adata.frame
with a variable forfoo
,bar
andbar
. Would this be easier to use later on than what I created as my example?
– Nicholas Riley
Nov 13 '18 at 14:30
Thanks. This creates a
data.frame
with a variable for foo
, bar
and bar
. Would this be easier to use later on than what I created as my example?– Nicholas Riley
Nov 13 '18 at 14:30
Thanks. This creates a
data.frame
with a variable for foo
, bar
and bar
. Would this be easier to use later on than what I created as my example?– Nicholas Riley
Nov 13 '18 at 14:30
add a comment |
Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...
get_code_reason_mapping <- function(alert_text) {
alert_text %>%
str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
unlist() %>%
str_split_fixed("=", 2) %>%
as.data.table() %>%
setnames(c('code', 'reason'))
}
encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)
get_why_codes <- function(alert_text) {
alert_text %>%
str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
unlist() %>%
str_sub(-1) %>%
as.data.table() %>%
setnames(c('code'))
}
encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)
get_code_counts <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)
code_counts[is.na(code_counts)] <- 0
code_counts
add a comment |
Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...
get_code_reason_mapping <- function(alert_text) {
alert_text %>%
str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
unlist() %>%
str_split_fixed("=", 2) %>%
as.data.table() %>%
setnames(c('code', 'reason'))
}
encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)
get_why_codes <- function(alert_text) {
alert_text %>%
str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
unlist() %>%
str_sub(-1) %>%
as.data.table() %>%
setnames(c('code'))
}
encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)
get_code_counts <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)
code_counts[is.na(code_counts)] <- 0
code_counts
add a comment |
Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...
get_code_reason_mapping <- function(alert_text) {
alert_text %>%
str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
unlist() %>%
str_split_fixed("=", 2) %>%
as.data.table() %>%
setnames(c('code', 'reason'))
}
encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)
get_why_codes <- function(alert_text) {
alert_text %>%
str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
unlist() %>%
str_sub(-1) %>%
as.data.table() %>%
setnames(c('code'))
}
encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)
get_code_counts <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)
code_counts[is.na(code_counts)] <- 0
code_counts
Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...
get_code_reason_mapping <- function(alert_text) {
alert_text %>%
str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
unlist() %>%
str_split_fixed("=", 2) %>%
as.data.table() %>%
setnames(c('code', 'reason'))
}
encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)
get_why_codes <- function(alert_text) {
alert_text %>%
str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
unlist() %>%
str_sub(-1) %>%
as.data.table() %>%
setnames(c('code'))
}
encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)
get_code_counts <- function(df1, df2) {
left_join(df1, df2) %>%
count(reason) %>%
spread(reason, n)
}
code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)
code_counts[is.na(code_counts)] <- 0
code_counts
answered Nov 15 '18 at 2:10
Nicholas RileyNicholas Riley
37.1k586114
37.1k586114
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53273549%2fparsing-a-string-into-a-nested-data-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown