How can I get the first registry with a type A and the first registry with a type B and move it to a single...

up vote
1
down vote

favorite

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

Id,type,Datetime 1, A, 2018-01-01 1, I, 2018-01-02 1, I, 2018-01-03 1, C, 2018-01-04 2, I, 2018-01-01 2, B, 2018-01-02 2, J, 2018-01-03 2, K, 2018-01-04 2, D, 2018-01-05 2, C, 2018-01-06 3, A, 2018-01-03 3, D, 2018-01-09 4, A, 2018-01-03 4, D, 2018-01-05 4, C, 2018-01-04

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

up vote
1
down vote

favorite

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

up vote
1
down vote

favorite

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

python pandas bigdata databricks

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

edited Nov 10 at 13:12

Community♦

edited Nov 10 at 13:12

Community♦

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0811425

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])

df = df.sort_values(['Id','Datetime'])



df1 = (df[df['type'].isin(['A','B'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))

df2 = (df[df['type'].isin(['C','D'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()

print (df)

   Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate

0   1            A  2018-01-01               C     2018-01-04

1   2            B  2018-01-02               D     2018-01-05

2   3            A  2018-01-03               D     2018-01-09

3   4            A  2018-01-03               C     2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

305k20239314

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])

df = df.sort_values(['Id','Datetime'])



df1 = (df[df['type'].isin(['A','B'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))

df2 = (df[df['type'].isin(['C','D'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()

print (df)

   Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate

0   1            A  2018-01-01               C     2018-01-04

1   2            B  2018-01-02               D     2018-01-05

2   3            A  2018-01-03               D     2018-01-09

3   4            A  2018-01-03               C     2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

305k20239314

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])

df = df.sort_values(['Id','Datetime'])



df1 = (df[df['type'].isin(['A','B'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))

df2 = (df[df['type'].isin(['C','D'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()

print (df)

   Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate

0   1            A  2018-01-01               C     2018-01-04

1   2            B  2018-01-02               D     2018-01-05

2   3            A  2018-01-03               D     2018-01-09

3   4            A  2018-01-03               C     2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

305k20239314

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])

df = df.sort_values(['Id','Datetime'])



df1 = (df[df['type'].isin(['A','B'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))

df2 = (df[df['type'].isin(['C','D'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()

print (df)

   Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate

0   1            A  2018-01-01               C     2018-01-04

1   2            B  2018-01-02               D     2018-01-05

2   3            A  2018-01-03               D     2018-01-09

3   4            A  2018-01-03               C     2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

305k20239314

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])

df = df.sort_values(['Id','Datetime'])



df1 = (df[df['type'].isin(['A','B'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))

df2 = (df[df['type'].isin(['C','D'])]

          .drop_duplicates('Id')

          .set_index('Id')

          .rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()

print (df)

   Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate

0   1            A  2018-01-01               C     2018-01-04

1   2            B  2018-01-02               D     2018-01-05

2   3            A  2018-01-03               D     2018-01-09

3   4            A  2018-01-03               C     2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

305k20239314

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

305k20239314

answered Nov 10 at 12:26

jezrael

305k20239314

answered Nov 10 at 12:26

jezrael

305k20239314

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nrthugu