The Data­base Sys­tems and Inform­a­tion Man­age­ment (DIMA) Group is cur­rently seek­ing to hire a Rese­arch Asso­ci­ate (PhD Stu­dent) to con­duct R&D in the Optim­iz­a­tion of Data Sci­ence Pro­ces­ses, under a joint ini­ti­at­ive bet­ween DIMA and the Max Del­brück Cen­ter for Molecu­lar Medi­cine (MDC) and the aus­pices of the HEI­BRIDS (Helm­holtz Ein­stein Inter­na­tio­nal Ber­lin Rese­arch School in Data Sci­ence) PhD pro­gram. Apply­ing data sci­ence meth­ods typ­ic­ally invol­ves a tedi­ous, iter­at­ive pro­cess of spe­cify­ing and exe­cut­ing com­plex data ana­lysis pipe­lines. These pipe­lines are com­pri­sed of pre­pro­ces­sing steps, model build­ing, and per­form­ance eval­u­ation. Het­ero­gen­eous data sources and sys­tems for pipe­line exe­cu­tion often intro­duce com­plex depend­en­cies on input data and pro­ces­sing infra­struc­ture. In order to sim­plify and accel­er­ate this tedi­ous data ana­lysis pro­cess, it would be highly bene­fi­cial to enable the declar­at­ive spe­cific­a­tion of such pipe­lines. To over­come these defi­cien­cies and chal­len­ges, we pro­pose the fol­low­ing rese­arch dir­ec­tions: (a) the design of a hol­istic declar­at­ive spe­cific­a­tion for data sci­ence pipe­lines, which add­res­ses the afore­men­tio­ned require­ments of declar­ativ­ity, sup­port for dif­fer­ent exe­cu­tion envir­on­ments, auto­matic data val­id­a­tion and record­ing of meta­data, (b) the imple­ment­a­tion of a sys­tem for the optim­ized exe­cu­tion of pipe­lines exp­res­sed in a declar­at­ive spe­cific­a­tion with sup­port for dif­fer­ent run­ti­mes (e.g., trans­la­tion to a mixed Spark/Ten­sor­Flow work­load with exper­i­ment track­ing enab­led, trans­la­tion to trans­ac­tions inside a data­base with a machine learn­ing exten­sion), and (c) the util­iz­a­tion of an exper­i­ment data­base to auto­mat­ic­ally recom­mend tests for poten­tial data errors in the pipe­line (e.g., wrong data types, miss­ing nor­mal­iz­a­tion of the data).


Suc­cess­fully com­ple­ted uni­ver­sity degree (Mas­ter, Dip­lom or equi­val­ent) in Data Man­age­ment, Dis­trib­uted Sys­tems, or Scal­able Data Ana­lysis. App­lic­ants should be stron­gly motiv­ated to work in a lead­ing rese­arch area, deve­lop sys­tems, and con­duct rese­arch in a real-world app­lic­a­tion set­ting. Ide­ally, app­lic­ants should pos­sess know­ledge in pro­gram­ming lan­guages and com­pilers, data sci­ence and machine learn­ing, the design of pro­gram­ming lan­guages, and dis­trib­uted pro­gram­ming. App­lic­ants exper­i­enced in big data ana­lyt­ics sys­tems (e.g., Apa­che Flink, Hadoop or Spark), open source devel­op­ment, and imple­ment­ing par­al­lel data­base solu­tions will be loo­ked upon favor­ably. Fur­ther­more, good know­ledge of both Ger­man and Eng­lish are stron­gly desi­red. Howe­ver, at a min­imum, can­did­ates must be able to com­mu­nic­ate in Eng­lish.

