How can I get the first registry with a type A and the first registry with a type B and move it to a single...











up vote
1
down vote

favorite












I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.










share|improve this question
























  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01












  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14

















up vote
1
down vote

favorite












I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.










share|improve this question
























  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01












  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14















up vote
1
down vote

favorite









up vote
1
down vote

favorite











I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.










share|improve this question















I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.







python pandas bigdata databricks






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 13:12









Community

11




11










asked Nov 10 at 12:00









Gabriel Monteiro Nepomuceno

1,0811425




1,0811425












  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01












  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14




















  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01












  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14


















What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01






What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01














The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05




The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05












I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07




I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07




1




1




Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11




Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11












It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14






It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14














1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer























  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34












  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');
}
);

Post as a guest
































1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer























  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34












  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51















up vote
1
down vote



accepted










First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer























  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34












  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51













up vote
1
down vote



accepted







up vote
1
down vote



accepted






First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer














First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 10 at 12:50

























answered Nov 10 at 12:26









jezrael

305k20239314




305k20239314












  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34












  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51


















  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34












  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51
















Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34






Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34














@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39




@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39












Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47




Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47




1




1




@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51




@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

Coverage of Google Street View

Full-time equivalent

Surfing